Module 01 — Async Rust Fundamentals
Track: Foundation — Mission Control Platform
Position: Module 1 of 6
Source material: Async Rust — Maxwell Flitton & Caroline Morton, Chapters 1, 2, 7
Quiz pass threshold: 70% on all three lessons to unlock the project
- Mission Context
- What You Will Learn
- Lessons
- Capstone Project — Async Telemetry Ingestion Broker
- Prerequisites
- What Comes Next
Mission Context
Meridian's legacy Python control plane was built for a 6-satellite constellation. It handles ground station connections sequentially: one connection at a time, blocking on each telemetry frame before moving to the next. At 48 satellites across 12 ground station sites, this architecture is the primary bottleneck in the control plane. During peak pass windows, the broker accumulates up to 40 seconds of delivery lag — unacceptable for conjunction avoidance workflows that require sub-10-second frame delivery.
This module establishes the async Rust foundation for the replacement system. Before writing any production control plane code, you need an accurate model of how async Rust executes — not at the surface level of #[tokio::main], but at the level of futures, the polling contract, executor scheduling, and task lifecycle. Every architectural decision in the modules that follow depends on this model.
What You Will Learn
By the end of this module you will be able to:
- Implement the
Futuretrait directly and trace the polling lifecycle from first call to completion - Explain the waker contract and identify futures that will silently stall due to missing waker registration
- Distinguish between
tokio::spawn,tokio::join!, andtokio::select!and apply each to the correct concurrency pattern - Configure a Tokio runtime explicitly via
Builder, size worker and blocking thread pools for a given workload profile, and isolate high-frequency I/O workloads from blocking work - Cancel tasks safely using
.abort()andtokio::time::timeout, understanding exactly where and when the future is dropped - Implement a graceful shutdown sequence with a bounded drain deadline, using RAII and
CancellationTokenfor async cleanup
Lessons
Lesson 1 — The async/await Model: Futures, Polling, and the Executor Loop
Covers the Future trait, the poll function, Poll::Ready vs Poll::Pending, the waker contract, Pin, and the executor's task queue. Includes a manually implemented future to make the state machine mechanics explicit.
Key question this lesson answers: What actually happens at every await point, and what causes a task to silently stall?
→ lesson-01-async-await-model.md / lesson-01-quiz.toml
Lesson 2 — The Tokio Runtime: Spawning Tasks, the Scheduler, and Thread Pools
Covers Tokio's multi-thread work-stealing scheduler, the distinction between worker threads and blocking threads, tokio::task::spawn_blocking, and explicit runtime configuration via Builder. Includes a dual-runtime pattern for isolating ingress and housekeeping workloads.
Key question this lesson answers: How do you configure the runtime for your actual workload rather than the defaults, and when does blocking work need to leave the async executor?
→ lesson-02-tokio-runtime.md / lesson-02-quiz.toml
Lesson 3 — Task Lifecycle: Cancellation, Timeouts, and JoinHandle Management
Covers JoinHandle<T> semantics, cooperative cancellation with .abort(), RAII cleanup on cancellation, tokio::time::timeout, tokio::select! for racing futures, and a complete graceful shutdown pattern using broadcast and a bounded drain deadline.
Key question this lesson answers: How do you cleanly terminate a task — whether it completes normally, times out, or receives a shutdown signal — without leaking resources or corrupting state?
→ lesson-03-task-lifecycle.md / lesson-03-quiz.toml
Capstone Project — Async Telemetry Ingestion Broker
Build the TCP ingress layer for Meridian's replacement control plane. The broker accepts concurrent connections from ground stations, reads length-prefixed telemetry frames, fans each frame out to multiple downstream handlers via a broadcast channel, and shuts down gracefully on Ctrl-C with a 10-second drain deadline.
Acceptance is against 7 verifiable criteria including concurrent connection handling, broadcast fan-out correctness, slow-handler isolation, and bounded graceful shutdown.
→ project-async-telemetry-broker.md
Prerequisites
This module assumes you are comfortable with Rust ownership, borrowing, traits, and closures. It does not re-explain language fundamentals. If you are new to async Rust generally, the module starts from first principles at the trait level — but it expects you to read Rust error messages without assistance.
What Comes Next
Module 2 — Concurrency Primitives builds directly on this foundation: now that you understand how the executor runs tasks, Module 2 covers how those tasks share state safely — Mutex, RwLock, atomics, and memory ordering. The ground station command queue project in Module 2 connects directly to the telemetry broker you build here.
Lesson 1 — The async/await Model: Futures, Polling, and the Executor Loop
Module: Foundation — M01: Async Rust Fundamentals
Position: Lesson 1 of 3
Source: Async Rust — Maxwell Flitton & Caroline Morton, Chapters 1–2
Context
Meridian's legacy Python control plane was designed for a 6-satellite constellation. It handles ground station connections sequentially: accept a connection, process its telemetry frame, move to the next connection. At 6 satellites, this was acceptable. At 48 satellites across 12 ground station sites, it is a bottleneck. A single slow uplink from a station in the Atacama Desert holds up frames from every other active connection. The Python GIL makes true parallelism on this I/O-bound workload impossible without forking processes, which multiplies memory overhead and complicates shared state.
The replacement control plane is being written in Rust with tokio. Before writing any of that system, you need an accurate mental model of how async Rust actually executes code — not at the level of the tokio macro, but at the level of futures, the polling protocol, and the executor's task queue. Misunderstanding this model is the root cause of most async Rust bugs in production: dropped wakers, blocking the executor thread, and state machine explosions that are impossible to reason about.
This lesson covers the mechanics that every await desugars into. By the end, you should be able to read a future's poll implementation and trace exactly when it will make progress, when it will yield, and what will wake it back up. That skill is indispensable when debugging a hung ground station connection at 0300.
Source: Async Rust, Chapters 1–2 (Flitton & Morton)
Core Concepts
What Async Actually Is
Async programming does not add CPU cores. It reorganizes work so that dead time — waiting for a network response, waiting for a disk write — is used to make progress on other tasks. The classic analogy: you do not stand still while the kettle boils. You put the bread in the toaster. The key insight is that both tasks share one pair of hands but interleave their execution during wait periods.
In Rust, this interleaving is explicit and zero-cost. There is no runtime scheduler running on a background OS thread intercepting your code. Instead, you write state machines, and the Rust compiler compiles async fn into those state machines for you. await is a yield point — a place where the current task volunteers to give up the thread so another task can run.
This is the critical difference from threads: with threads, preemption is involuntary. With async tasks, yield is voluntary, at every await. A task that never hits an await — one that runs a tight CPU loop — will starve every other task on that executor thread. This is not hypothetical. In Meridian's uplink pipeline, a single malformed frame that triggers O(n²) validation holds the entire thread if there's no await in the hot path.
The Future Trait
Every value produced by an async fn or an async block implements Future. The trait is:
#![allow(unused)] fn main() { pub trait Future { type Output; fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output>; } }
Poll has two variants: Poll::Ready(value) when the computation is complete, and Poll::Pending when the future cannot yet make progress and should be woken up later.
The poll function is not async. This matters: futures are polled synchronously. The executor calls poll, it runs synchronously until it either completes or hits a point where it cannot proceed, and then it returns. If it returns Pending, it is the future's responsibility to arrange for a wake-up. If it returns Pending without registering a waker, the task will never run again — a silent deadlock.
The Waker Contract
Context carries a Waker. The Waker is a handle that, when called, schedules the associated task back onto the executor's run queue. The contract is: if poll returns Pending, it must have called cx.waker().wake_by_ref() or stored the waker to be called later when the awaited resource becomes available.
Violating this contract — returning Pending without registering the waker — produces a future that stalls forever with no error. The executor sees a pending task, never reschedules it, and the task silently vanishes from the run queue. At the Meridian scale, this manifests as a ground station connection that goes quiet mid-session: no error, no disconnect, just silence until the session timeout fires.
The executor side of this contract: when a waker is called, the executor re-queues the task and eventually calls poll again. The future may be polled many times before it completes. The state it needs to resume must be owned by the future struct itself — this is why async Rust desugars async fn into a struct that holds all local variables as fields.
Pinning
The poll signature takes Pin<&mut Self> rather than &mut Self. Pin prevents the future from being moved in memory after it has been pinned. This matters because async state machines frequently contain self-referential structures: a future that awaits another future may hold a reference into its own fields. If the outer future were moved, that reference would dangle.
Pin enforces at compile time that once you call poll, the future cannot be moved. For futures composed entirely of Unpin types (most standard types), this is a no-op. For futures holding references into themselves — which the compiler generates automatically from async fn — it is essential.
Practical implication: you cannot call poll directly on a future obtained from an async fn without first pinning it via Box::pin(future) or tokio::pin!(future). tokio::spawn handles this for you; you only encounter it directly when building custom executors or when polling a future by reference inside select!.
tokio::pin! — Polling a Future by Reference
tokio::pin! pins a value to the current stack frame in place, making it safe to poll by mutable reference. The common situation where this matters: you need to start an async operation once and track its progress across multiple iterations of a select! loop, rather than restarting it fresh on every iteration.
Consider fetching a TLE catalog update while simultaneously processing incoming session commands. The fetch should run to completion in the background; the command loop should not restart it on each iteration:
use tokio::sync::mpsc; use tokio::time::{sleep, Duration}; async fn fetch_tle_update() -> Vec<u8> { // Simulate a slow catalog fetch — ~200ms in production. sleep(Duration::from_millis(200)).await; vec![0u8; 64] // placeholder TLE payload } #[tokio::main] async fn main() { let (tx, mut rx) = mpsc::channel::<String>(8); // Spawn a sender to simulate incoming commands. tokio::spawn(async move { for cmd in ["REPOINT", "STATUS", "RESET"] { sleep(Duration::from_millis(60)).await; let _ = tx.send(cmd.to_string()).await; } }); // Create the future ONCE, outside the loop. let tle_fetch = fetch_tle_update(); // Pin it to the stack so we can poll it by reference (&mut tle_fetch). tokio::pin!(tle_fetch); let mut tle_done = false; loop { tokio::select! { // Poll the same future instance each iteration. // Without tokio::pin!, each iteration would call fetch_tle_update() // again, creating a brand-new future and discarding all progress. tle = &mut tle_fetch, if !tle_done => { println!("TLE update received: {} bytes", tle.len()); tle_done = true; } Some(cmd) = rx.recv() => { println!("command received: {cmd}"); if cmd == "RESET" { break; } } else => break, } } }
Two things to notice. First, tle_fetch is created before the loop and pinned with tokio::pin!. Inside select!, &mut tle_fetch polls the same future on every iteration — it accumulates progress across polls. If you wrote fetch_tle_update() directly inside select!, you would get a new future each time and the fetch would restart from zero on every loop iteration.
Second, the , if !tle_done precondition disables the branch once the fetch has completed. This is essential: if the branch stays enabled after the future resolves, select! would attempt to poll an already-completed future on the next iteration, causing a "async fn resumed after completion" panic. The precondition guards against this. The next section covers preconditions in full.
The Executor Loop
The executor maintains a run queue of tasks ready to be polled. Its loop is approximately:
- Pop a task from the ready queue.
- Call
pollon it. - If
Poll::Ready, the task is done — drop it. - If
Poll::Pending, the task is parked. It will be re-queued only when its waker is called.
Tasks are not re-polled speculatively. They are polled exactly when woken. This means a task can sit in Pending state indefinitely if nothing triggers its waker — which is the correct behavior for a task waiting on a network connection that has gone silent.
tokio::spawn places a task on the executor's ready queue. tokio::join! polls multiple futures concurrently on the same task — no new OS threads, no new tasks, just interleaved polling within the same scheduler slot. tokio::spawn creates a new independent task that can be scheduled on any worker thread.
Code Examples
Implementing Future Directly — A Telemetry Frame Validator
This example implements Future manually to illustrate what async/await desugars into. In production this would be an async fn, but seeing the state machine explicitly clarifies exactly when control yields and what triggers resumption.
The scenario: validating an incoming telemetry frame header requires checking a CRC that is computed in a background thread pool. The future polls a oneshot channel for the result.
use std::future::Future; use std::pin::Pin; use std::task::{Context, Poll}; use tokio::sync::oneshot; /// Represents a frame whose header CRC is being validated asynchronously. /// The validation runs on a blocking thread; this future waits for its result. pub struct FrameValidationFuture { // oneshot::Receiver implements Future directly, but we wrap it here // to show the polling mechanics explicitly. receiver: oneshot::Receiver<bool>, } impl FrameValidationFuture { pub fn new(receiver: oneshot::Receiver<bool>) -> Self { Self { receiver } } } impl Future for FrameValidationFuture { type Output = Result<(), String>; fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> { // Pin::new is safe here because oneshot::Receiver is Unpin. // For a self-referential type we'd need unsafe or box-pinning. match Pin::new(&mut self.receiver).poll(cx) { Poll::Ready(Ok(true)) => Poll::Ready(Ok(())), Poll::Ready(Ok(false)) => { Poll::Ready(Err("CRC validation failed".to_string())) } Poll::Ready(Err(_)) => { // Sender dropped without sending — the validator thread panicked // or was cancelled. Treat as a validation failure, not a panic. Poll::Ready(Err("Validator thread terminated unexpectedly".to_string())) } // The result is not ready yet. The oneshot::Receiver has already // registered cx's waker — it will call it when a value is sent. // We return Pending; the executor parks this task. Poll::Pending => Poll::Pending, } } } #[tokio::main] async fn main() { let (tx, rx) = oneshot::channel::<bool>(); // Simulate the CRC validator running on a blocking thread pool. tokio::spawn(async move { // In production: tokio::task::spawn_blocking(|| compute_crc(...)).await // Here we just send a valid result immediately. let _ = tx.send(true); }); let validation = FrameValidationFuture::new(rx); match validation.await { Ok(()) => println!("Frame header valid — forwarding to telemetry pipeline"), Err(e) => eprintln!("Frame rejected: {e}"), } }
The poll implementation delegates to the inner oneshot::Receiver's own poll. When Receiver::poll returns Pending, it has already stored the waker from cx internally. When tx.send(true) fires, Receiver calls that waker, which re-queues this task. No manual waker management is needed here because we compose with a type that already handles it correctly.
This is the pattern to follow when building custom futures: compose with existing futures and channel primitives wherever possible. Write unsafe waker code only when you are bridging to a non-async notification source (an epoll fd, a hardware interrupt, a C library callback).
Concurrent Polling with tokio::join! vs. Sequential await
Sequential await for two telemetry frame fetches from different ground stations means the second fetch does not start until the first completes:
#![allow(unused)] fn main() { // SEQUENTIAL — total latency = latency(station_a) + latency(station_b) let frame_a = fetch_frame("gs-atacama").await?; let frame_b = fetch_frame("gs-svalbard").await?; }
tokio::join! polls both concurrently on the same task. While one is pending, the executor can drive the other forward:
use anyhow::Result; use tokio::net::TcpStream; async fn fetch_frame(station_id: &str) -> Result<Vec<u8>> { // Simplified: in production this reads from a persistent connection pool. let mut _stream = TcpStream::connect(format!("{station_id}:7777")).await?; // ... read frame bytes ... Ok(vec![]) // placeholder } #[tokio::main] async fn main() -> Result<()> { // CONCURRENT — total latency ≈ max(latency_a, latency_b) // Both futures are polled in the same task; no new OS threads are created. let (frame_a, frame_b) = tokio::join!( fetch_frame("gs-atacama"), fetch_frame("gs-svalbard") ); // Both results are available here; handle errors independently. match (frame_a, frame_b) { (Ok(a), Ok(b)) => { println!("Received {} + {} bytes from ground stations", a.len(), b.len()); } (Err(e), _) | (_, Err(e)) => { eprintln!("Ground station fetch failed: {e}"); } } Ok(()) }
tokio::join! is appropriate when the futures are independent and you need both results. If you only need the first result and want to cancel the loser, use tokio::select!. If the futures have no data dependency and need to run across multiple threads simultaneously, tokio::spawn each and join the handles.
Key Takeaways
-
The
Futuretrait'spollmethod is synchronous. An async runtime is a loop that callspollon ready tasks; it does not preempt running tasks. A future that does significant CPU work without anawaitwill monopolize its executor thread. -
If
pollreturnsPoll::Pendingwithout registering the context's waker, the task is silently parked forever. Always verify that the resource you're awaiting will call the waker when it becomes available. -
Pin<&mut Self>exists to prevent futures from being moved after polling begins. For futures containing self-referential state (which the compiler generates automatically), this is load-bearing. Most composed futures areUnpin; the constraint only bites when bridging to raw async primitives. -
tokio::join!achieves concurrency within a single task by interleaved polling. It does not create threads or new tasks. Use it for independent I/O operations that should proceed simultaneously but whose results you need together. -
tokio::pin!pins a future to the current stack frame so it can be polled by mutable reference across multipleselect!iterations. Use it when you need to start an operation once and track its progress, not restart it on each loop. Always pair it with a precondition (, if !done) to prevent polling the future after it has already resolved. -
Every
async fnis compiled into a state machine struct. Variables live acrossawaitpoints become fields of that struct. Understanding this explains why async Rust futures can be large, why they must be pinned, and why capturing large values across await points inflates memory use.
Lesson 2 — The Tokio Runtime: Spawning Tasks, the Scheduler, and Thread Pools
Module: Foundation — M01: Async Rust Fundamentals
Position: Lesson 2 of 3
Source: Async Rust — Maxwell Flitton & Caroline Morton, Chapter 7
Context
The Meridian control plane receives telemetry from 48 satellite uplinks simultaneously. Each uplink connection is long-lived: a ground station holds a persistent TCP session with the control plane and streams frames at irregular intervals driven by orbital geometry and antenna alignment. Alongside these connections, the control plane runs housekeeping tasks — session health checks, TLE refresh from the catalog, and periodic flush of buffered frames to the downstream aggregator.
The #[tokio::main] macro stands up a default multi-threaded runtime and runs your async main function in it. For prototyping and simple services, this is sufficient. For a system with the throughput profile and operational requirements of Meridian's control plane, you need to understand what that runtime is actually doing — how many threads it allocates, how it distributes work across them, what happens when a blocking operation enters the mix, and how to configure it for your actual workload rather than the defaults.
This lesson covers Tokio's scheduler architecture, the distinction between async tasks and blocking tasks, how to size thread pools for I/O-bound vs. compute-bound work, and how to configure the runtime explicitly via Builder. The goal is not to tune prematurely — it is to understand the model well enough to make deliberate choices rather than accepting defaults that may be wrong for your system.
Source: Async Rust, Chapter 7 (Flitton & Morton)
Core Concepts
The Multi-Thread Scheduler
Tokio's default multi_thread scheduler maintains a pool of worker threads — by default, one per logical CPU core. Each worker thread has a local run queue. Tasks spawned with tokio::spawn go onto a global run queue and are stolen by whichever worker thread is idle. This work-stealing design keeps all workers busy when there is backlog, at the cost of some cross-thread synchronization.
Each worker runs the same loop from Lesson 1: pop a ready task, call poll, re-queue it if it returns Pending, drop it if Ready. When a worker's local queue is empty, it attempts to steal tasks from other workers' queues before checking the global queue. The global_queue_interval configuration controls how many local-queue tasks a worker processes before checking the global queue — the default is 61. Lowering this value gives newly spawned tasks lower latency at the cost of more global-queue contention.
The current_thread runtime (used by #[tokio::main] in tests and the single-threaded mode) runs all tasks on the calling thread. It is appropriate for services that are purely I/O-bound with no CPU-intensive tasks and where single-threaded throughput is sufficient. The Meridian control plane uses the multi-thread runtime.
Worker Threads and Blocking Threads
Tokio distinguishes between two kinds of threads:
Worker threads run the async executor loop. They poll futures and run async task code. There should be enough of them to saturate your I/O capacity without exceeding your core count. A typical production setting is num_cpus::get(), which Builder::new_multi_thread() uses by default.
Blocking threads are spawned on demand by tokio::task::spawn_blocking. They run synchronous, blocking code — file I/O, CPU-intensive computation, synchronous library calls — in a separate thread pool that does not interfere with the async worker threads. The key rule: never perform blocking I/O or long CPU work directly on an async worker thread. Doing so parks that thread for the duration, reducing effective parallelism and starving other tasks.
max_blocking_threads caps the number of blocking threads that can exist simultaneously. The default is 512. For the Meridian control plane, which may process TLE bulk imports concurrently with live uplinks, sizing this correctly prevents runaway thread creation under load spikes.
tokio::spawn and Task Identity
tokio::spawn places a new task onto the runtime's global queue. It returns a JoinHandle<T> — a handle to the spawned task's eventual output. The task is independent of the spawner: if the spawner drops the handle, the task continues running (though its output is lost). If you need the task's output, keep the handle and .await it. If you need to cancel the task, call .abort() on the handle.
Spawned tasks must be 'static — they cannot borrow from the spawning scope. If the task needs data from the spawner, move it in with async move { ... }, clone cheaply clonable data (like Arc-wrapped state), or use channels to communicate.
A common mistake is spawning a task per connection without any admission control. At 48 uplinks with 100 frames per second each, that is 4,800 task-spawns per second for frame processing alone. Task creation has overhead. For Meridian's frame processing workload, a bounded task pool or a pipeline of fixed workers is more appropriate than an unbounded spawn-per-frame pattern.
Configuring the Runtime Explicitly
The #[tokio::main] macro is shorthand for building a default runtime and blocking on the async main function. Replacing it with an explicit Builder gives fine-grained control:
use tokio::runtime::Builder; fn main() { let runtime = Builder::new_multi_thread() .worker_threads(8) .max_blocking_threads(16) .thread_name("meridian-worker") .thread_stack_size(2 * 1024 * 1024) .enable_all() .build() .expect("failed to build Tokio runtime"); runtime.block_on(async_main()); }
The runtime is a value. You can have multiple runtimes in the same process — useful when you need strict resource isolation between subsystems (e.g., keeping the telemetry ingress runtime separate from the housekeeping runtime so a housekeeping spike does not starve active uplinks).
Code Examples
Explicit Runtime Configuration for the Meridian Control Plane
Meridian's control plane has two distinct workload profiles that benefit from isolated runtimes: the high-frequency telemetry ingress path (many short-lived I/O tasks) and the housekeeping path (fewer, slower tasks including blocking TLE catalog refreshes). Sharing a single runtime risks head-of-line blocking when a TLE import saturates the blocking thread pool.
use std::sync::LazyLock; use tokio::runtime::{Builder, Runtime}; // Ingress runtime: tuned for concurrent I/O — one worker per core, // minimal blocking threads since real blocking work routes to the // housekeeping runtime. static INGRESS_RUNTIME: LazyLock<Runtime> = LazyLock::new(|| { Builder::new_multi_thread() .worker_threads(num_cpus::get()) .max_blocking_threads(4) .thread_name("meridian-ingress") .thread_stack_size(2 * 1024 * 1024) .on_thread_start(|| tracing::debug!("ingress worker started")) .enable_all() .build() .expect("failed to build ingress runtime") }); // Housekeeping runtime: fewer workers, more blocking threads for // catalog refreshes and frame archival. static HOUSEKEEPING_RUNTIME: LazyLock<Runtime> = LazyLock::new(|| { Builder::new_multi_thread() .worker_threads(2) .max_blocking_threads(32) .thread_name("meridian-housekeeping") .enable_all() .build() .expect("failed to build housekeeping runtime") }); async fn handle_uplink_session() { // This runs on an ingress worker thread. // Long-running I/O awaits are fine here. tokio::time::sleep(tokio::time::Duration::from_millis(10)).await; tracing::info!("uplink session processed"); } async fn refresh_tle_catalog() { // CPU + blocking I/O — route to spawn_blocking so we do not // park an ingress worker for the duration of the refresh. tokio::task::spawn_blocking(|| { // Synchronous HTTP fetch + file write; blocks for ~200ms. tracing::info!("TLE catalog refreshed"); }) .await .expect("TLE refresh panicked"); } fn main() { // Ingress and housekeeping run in separate thread pools. // A TLE refresh spike cannot starve active uplink sessions. std::thread::spawn(|| { HOUSEKEEPING_RUNTIME.block_on(async { loop { refresh_tle_catalog().await; tokio::time::sleep(tokio::time::Duration::from_secs(300)).await; } }); }); INGRESS_RUNTIME.block_on(async { // In production: bind TCP listener, accept connections, // spawn handle_uplink_session per connection. for _ in 0..48 { INGRESS_RUNTIME.spawn(handle_uplink_session()); } tokio::time::sleep(tokio::time::Duration::from_secs(1)).await; }); }
The on_thread_start hook enables per-thread tracing setup. In a production deployment, this is where you would initialize thread-local metrics recorders. The thread_name setting surfaces in top, htop, and perf output — essential when profiling which runtime is responsible for CPU usage.
Dispatching Blocking Work Correctly
The most common async-correctness mistake in production Rust services is calling blocking code on an async worker thread. The rule is simple but frequently violated: if a function does not have async in its signature and it does any I/O or takes more than a few hundred microseconds, it belongs in spawn_blocking.
use anyhow::Result; use tokio::task; /// Parse and validate a TLE record from a raw string. /// TLE parsing is synchronous and O(n) with input length. /// On a 100KB batch, this can take several milliseconds. fn parse_tle_batch_blocking(raw: String) -> Result<Vec<String>> { // Synchronous parsing — no I/O, but CPU-bound for large inputs. raw.lines() .filter(|l| l.starts_with("1 ") || l.starts_with("2 ")) .map(|l| Ok(l.to_string())) .collect() } async fn ingest_tle_update(raw_batch: String) -> Result<Vec<String>> { // Moving raw_batch into spawn_blocking satisfies the 'static bound. // The closure executes on a blocking thread; we await the JoinHandle. let records = task::spawn_blocking(move || parse_tle_batch_blocking(raw_batch)) .await // The outer error is a JoinError (task panicked or was aborted). // Propagate it as an application error. .map_err(|e| anyhow::anyhow!("TLE parser panicked: {e}"))??; Ok(records) } #[tokio::main] async fn main() -> Result<()> { let raw = "1 25544U 98067A 21275.52500000 .00001234 00000-0 12345-4 0 9999\n\ 2 25544 51.6400 337.6640 0007417 62.6000 297.5200 15.48889583300000\n" .to_string(); let records = ingest_tle_update(raw).await?; println!("Parsed {} TLE records", records.len()); Ok(()) }
The double ? on .await.map_err(...)?? deserves explanation: spawn_blocking returns Result<T, JoinError>, and parse_tle_batch_blocking itself returns Result<Vec<String>, anyhow::Error>. The first ? propagates JoinError (after mapping it), the second propagates the inner application error. Collapsing these correctly is a common stumbling point — do not use .unwrap() on JoinHandle in production code; a parser panic should not take down the ingress runtime.
Key Takeaways
-
Tokio's multi-thread scheduler uses work-stealing across a pool of worker threads (defaulting to one per logical CPU). Tasks spawned via
tokio::spawnenter the global queue and are picked up by idle workers. -
Worker threads and blocking threads serve different purposes. Never run synchronous blocking I/O or long CPU computation on a worker thread. Use
tokio::task::spawn_blockingto route blocking work to the dedicated blocking thread pool. -
Explicit
Builderconfiguration lets you control thread counts, stack sizes, thread names, and lifecycle hooks. Use it in production to separate high-frequency I/O workloads from lower-frequency blocking workloads, preventing one from starving the other. -
tokio::spawncreates a task with'staticlifetime. If you need to share data from the spawning scope, move it into the closure withasync move, wrap it inArc, or communicate via channels. -
Multiple runtimes in the same process are a valid pattern for resource isolation. Ingress and housekeeping workloads with fundamentally different resource profiles benefit from separate thread pools rather than competing on a shared executor.
Lesson 3 — Task Lifecycle: Cancellation, Timeouts, and JoinHandle Management
Module: Foundation — M01: Async Rust Fundamentals
Position: Lesson 3 of 3
Source: Async Rust — Maxwell Flitton & Caroline Morton, Chapters 2, 7
Context
The Meridian control plane manages connections that span orbital passes. A ground station connection is live while the target satellite is above the horizon — typically 8 to 12 minutes. When the pass ends, the connection should be torn down cleanly: in-flight frames flushed, session state persisted, downstream consumers notified. If the control plane is restarted mid-pass — rolling deploy, crash recovery, OOM kill — active tasks must be cancelled in a way that does not corrupt shared state or leave downstream systems with partial data.
Understanding task lifecycle is not optional for this system. Tasks that outlive their useful scope waste resources. Tasks cancelled without cleanup leave corrupted state. Tasks that silently swallow their errors make incident response a guessing game. The Tokio JoinHandle, the .abort() call, and tokio::time::timeout are the instruments for managing these concerns; this lesson covers each one in depth.
Source: Async Rust, Chapters 2 & 7 (Flitton & Morton)
Core Concepts
The JoinHandle and Task Output
tokio::spawn returns JoinHandle<T>. The handle has two primary uses: waiting for the task's output with .await, and cancelling the task with .abort().
.await on a JoinHandle<T> produces Result<T, JoinError>. JoinError indicates one of two things: the task panicked, or it was aborted. Distinguishing them matters:
#![allow(unused)] fn main() { match handle.await { Ok(value) => { /* normal completion */ } Err(e) if e.is_panic() => { /* task panicked — log and recover */ } Err(e) if e.is_cancelled() => { /* task was aborted */ } Err(_) => unreachable!(), } }
If you drop a JoinHandle without awaiting it, the task continues running — it is not cancelled. This is the correct behavior for fire-and-forget tasks. If you need the task to stop when the handle is dropped, use tokio_util::task::AbortOnDropHandle (a wrapper that calls .abort() on drop) or implement the same pattern manually.
Task Cancellation with .abort()
.abort() sends a cancellation signal to the task. The task does not stop immediately — it is cancelled at the next .await point. This is cooperative cancellation: the task's state machine is dropped when it next yields to the executor, which runs the Drop implementation of any held values.
The implication: resources guarded by RAII are dropped correctly on cancellation. A tokio::net::TcpStream held by the task will be closed. A MutexGuard will be released. A tokio::fs::File will be flushed and closed. What is not guaranteed: code after the .await where cancellation occurred will not run. If you have cleanup logic that must run regardless of cancellation, it must be in a Drop impl, not in code that follows an .await.
#![allow(unused)] fn main() { // This cleanup logic may NOT run if the task is cancelled at the await: async fn session_handler(id: u64) { process_frames().await; // <-- task may be cancelled here // The following line may never execute if aborted above. persist_session_state(id).await; // NOT guaranteed on cancellation } // This cleanup logic WILL run on cancellation because it is in Drop: struct Session { id: u64, state: SessionState, } impl Drop for Session { fn drop(&mut self) { // Synchronous cleanup only — no async here. // Flush to a synchronous in-memory buffer; a separate housekeeping // task drains the buffer to persistent storage. tracing::info!(session_id = self.id, "session dropped, state buffered"); } } }
CancellationToken and TaskTracker
broadcast and watch channels work for shutdown signalling, but tokio-util provides two purpose-built primitives that are cleaner for the specific problem of cooperative shutdown.
CancellationToken is a cloneable, shareable cancellation handle. Any clone of a token represents the same cancellation event: when .cancel() is called on any one of them, all clones see it. Tasks wait on .cancelled(), which returns a future that resolves when the token is cancelled:
use tokio::time::{sleep, Duration}; use tokio_util::sync::CancellationToken; async fn uplink_session(station_id: u32, token: CancellationToken) { loop { tokio::select! { // cancelled() is just a future — it composes naturally with select!. _ = token.cancelled() => { // Run async cleanup here before returning. // This is the key advantage over .abort(): we choose when to stop // and can flush state, send final messages, close connections. tracing::info!(station_id, "session received cancellation — draining"); flush_pending_frames().await; break; } _ = process_next_frame(station_id) => { // Normal frame processing continues until token is cancelled. } } } } async fn flush_pending_frames() { sleep(Duration::from_millis(10)).await; // placeholder } async fn process_next_frame(_id: u32) { sleep(Duration::from_millis(50)).await; // placeholder } #[tokio::main] async fn main() { let token = CancellationToken::new(); // Clone the token for each task — all clones share the same cancellation. let handles: Vec<_> = (0..4) .map(|id| { let t = token.clone(); tokio::spawn(uplink_session(id, t)) }) .collect(); // Simulate running for a short time then shutting down. sleep(Duration::from_millis(120)).await; // Cancel all sessions simultaneously with one call. token.cancel(); for handle in handles { let _ = handle.await; } tracing::info!("all sessions shut down"); }
The critical difference from .abort(): when the token fires, the task's select! arm runs, giving the task the opportunity to execute async cleanup before it exits. .abort() drops the future at the next .await with no opportunity for the task to run any further code.
CancellationToken::child_token() creates a child that is cancelled when the parent is cancelled, but can also be cancelled independently. Use this for hierarchical shutdown: cancel the top-level token to shut down everything, or cancel a child token to shut down one subsystem while leaving others running.
TaskTracker solves the drain-waiting problem more cleanly than collecting JoinHandles into a Vec. Spawn tasks through the tracker; call .close() when no more tasks will be added; then .wait() to block until all tracked tasks finish:
use tokio::time::{sleep, Duration}; use tokio_util::task::TaskTracker; #[tokio::main] async fn main() { let tracker = TaskTracker::new(); let token = tokio_util::sync::CancellationToken::new(); for station_id in 0..12u32 { let t = token.clone(); tracker.spawn(async move { tokio::select! { _ = t.cancelled() => { tracing::info!(station_id, "session shutting down"); } _ = sleep(Duration::from_secs(300)) => { tracing::info!(station_id, "session pass complete"); } } }); } // Signal that no more tasks will be spawned. // wait() will not resolve until close() has been called. tracker.close(); // Trigger shutdown. sleep(Duration::from_millis(50)).await; token.cancel(); // Block until all 12 sessions finish their cleanup. tracker.wait().await; tracing::info!("all sessions drained"); }
tracker.wait() only resolves after both conditions are true: all spawned tasks have completed, and tracker.close() has been called. The close() requirement prevents a race where wait() resolves between the last task finishing and the next one being spawned. Always call close() before wait().
tokio::time::timeout
tokio::time::timeout(duration, future) wraps any future and adds a deadline. If the future does not complete within the duration, it is cancelled and the wrapper returns Err(tokio::time::error::Elapsed).
#![allow(unused)] fn main() { use tokio::time::{timeout, Duration}; async fn fetch_frame_with_deadline(station: &str) -> anyhow::Result<Vec<u8>> { timeout(Duration::from_secs(5), fetch_frame(station)) .await // Elapsed is returned as Err — map it to an application error. .map_err(|_| anyhow::anyhow!("ground station {station} timed out after 5s"))? } }
A critical detail: timeout cancels the inner future when the deadline fires — with the same cooperative semantics as .abort(). The future is dropped at its next .await point after the deadline. If the future holds a database transaction or has submitted writes that should be rolled back on timeout, the transaction handle's Drop must handle the rollback.
For scenarios where you want to retry on timeout, wrap the timeout in a loop. For scenarios where you want to give a task one deadline with no retry, timeout is the right primitive. For scenarios where you want to cancel based on an external signal (graceful shutdown, satellite pass end), use CancellationToken or tokio::select! with a shutdown receiver.
tokio::select! for Racing Futures
tokio::select! polls multiple futures concurrently and completes with the first one that becomes ready, cancelling the others. It is the right tool for:
- Racing a task against a timeout
- Racing a task against a shutdown signal
- Implementing priority receive patterns on multiple channels
#![allow(unused)] fn main() { use tokio::sync::oneshot; async fn session_with_shutdown( session: impl std::future::Future<Output = ()>, mut shutdown: oneshot::Receiver<()>, ) { tokio::select! { _ = session => { tracing::info!("session completed normally"); } _ = &mut shutdown => { // Shutdown signal received — session future is cancelled here. // RAII cleanup in the session's Drop runs. tracing::info!("session cancelled: shutdown signal received"); } } } }
The branch that wins is executed; the branches that lose are cancelled (futures dropped at their next await point). If you need to do async cleanup when the losing branch is cancelled, you cannot do it inside select! — you need CancellationToken combined with a cleanup task.
Important: all branches of a select! run concurrently on the same task. They are never truly simultaneous — only one executes at a time — but they are polled in interleaved fashion within a single scheduler slot. This is distinct from tokio::spawn, which creates a new task that can run on a different worker thread. select! is lightweight concurrent multiplexing; spawn is independent parallel scheduling.
select! Loop Patterns and Branch Preconditions
select! is most often used inside a loop. Two patterns come up constantly in production systems.
Multi-channel drain with else: when a session task needs to drain from multiple upstream channels until all are closed:
#![allow(unused)] fn main() { use tokio::sync::mpsc; async fn drain_uplinks( mut primary: mpsc::Receiver<Vec<u8>>, mut redundant: mpsc::Receiver<Vec<u8>>, ) { loop { tokio::select! { // select! randomly picks which ready branch to check first — // this prevents the redundant channel from always being starved // if the primary is consistently busy. Some(frame) = primary.recv() => { process_frame(frame, "primary"); } Some(frame) = redundant.recv() => { process_frame(frame, "redundant"); } // else fires when ALL patterns fail — both channels returned None, // meaning both are closed. This is the clean exit condition. else => { tracing::info!("all uplink channels closed — drain complete"); break; } } } } fn process_frame(frame: Vec<u8>, source: &str) { tracing::debug!(bytes = frame.len(), source, "frame processed"); } }
The else branch is not optional when you pattern-match on Some(...). If both channels close and there is no else, select! will panic because no branch can make progress. Always include else when all branches use fallible patterns.
Branch preconditions: the , if condition syntax disables a branch before select! evaluates it. This is essential when polling a pinned future by reference inside a loop — once the future completes, the branch must be disabled or the next iteration will attempt to poll an already-resolved future, causing a panic:
use tokio::sync::mpsc; use tokio::time::{sleep, Duration}; async fn catalog_refresh() -> Vec<u8> { sleep(Duration::from_millis(100)).await; vec![0u8; 128] } #[tokio::main] async fn main() { let (_tx, mut cmd_rx) = mpsc::channel::<String>(8); let refresh = catalog_refresh(); tokio::pin!(refresh); let mut refresh_done = false; for _ in 0..5 { tokio::select! { // Branch is disabled once refresh_done = true. // Without this precondition: panic on second iteration. result = &mut refresh, if !refresh_done => { println!("catalog refreshed: {} bytes", result.len()); refresh_done = true; } Some(cmd) = cmd_rx.recv() => { println!("command: {cmd}"); } else => break, } } }
When the precondition is false, select! simply skips that branch. If all branches are disabled by preconditions, select! panics — so structure your logic to ensure at least one branch is always eligible or an else handles the case.
Graceful Shutdown Pattern
A production service needs a defined shutdown sequence. For the Meridian control plane:
- Stop accepting new connections.
- Signal active session tasks to finish or cancel.
- Wait for tasks to drain (with a deadline — do not wait forever).
- Flush pending telemetry to downstream consumers.
- Exit cleanly.
#![allow(unused)] fn main() { use std::time::Duration; use tokio::sync::broadcast; struct ShutdownCoordinator { sender: broadcast::Sender<()>, } impl ShutdownCoordinator { fn new() -> Self { let (sender, _) = broadcast::channel(1); Self { sender } } fn subscribe(&self) -> broadcast::Receiver<()> { self.sender.subscribe() } async fn shutdown(&self, tasks: Vec<tokio::task::JoinHandle<()>>) { // Signal all subscribers. let _ = self.sender.send(()); // Give tasks 10 seconds to drain. After that, abort stragglers. let deadline = Duration::from_secs(10); let _ = tokio::time::timeout(deadline, async { for handle in tasks { // Ignore individual task errors during shutdown. let _ = handle.await; } }) .await; } } }
The coordinator sends a shutdown signal over a broadcast channel. Each session task holds a Receiver and uses tokio::select! to race its work against the shutdown signal. After broadcasting, shutdown awaits all handles with a 10-second deadline. Any task that has not completed by then is left to the OS — in a containerized environment, the container will be killed by the orchestrator anyway.
Code Examples
Managing a Satellite Pass Session with Full Lifecycle Control
A pass session has a well-defined lifetime: it starts when the satellite rises above the ground station horizon and ends when it sets. The session task must complete cleanly if the pass ends normally, abort gracefully on shutdown, and timeout if the satellite goes silent mid-pass (antenna tracking failure, power anomaly).
use anyhow::Result; use tokio::{ sync::oneshot, time::{timeout, Duration}, }; use tracing::{info, warn}; #[derive(Debug)] struct PassSession { satellite_id: u32, ground_station: String, } impl Drop for PassSession { fn drop(&mut self) { // Synchronous state flush — no async. // In production, push final state to a lock-free ring buffer // that a background writer drains to persistent storage. info!( satellite_id = self.satellite_id, ground_station = %self.ground_station, "PassSession dropped — flushing state synchronously" ); } } impl PassSession { async fn run(&mut self) -> Result<()> { info!( satellite_id = self.satellite_id, "pass session started" ); // Simulate frame processing loop. // In production: read frames from TcpStream, validate, forward. for frame_num in 0u32..100 { tokio::time::sleep(Duration::from_millis(50)).await; info!(frame = frame_num, "frame processed"); } Ok(()) } } async fn manage_pass( satellite_id: u32, ground_station: String, pass_duration: Duration, mut shutdown_rx: oneshot::Receiver<()>, ) -> Result<()> { let mut session = PassSession { satellite_id, ground_station, }; // Race: session completion, pass duration timeout, or shutdown signal. tokio::select! { result = timeout(pass_duration, session.run()) => { match result { Ok(Ok(())) => info!(satellite_id, "pass completed normally"), Ok(Err(e)) => warn!(satellite_id, "session error: {e}"), Err(_) => warn!(satellite_id, "pass duration exceeded — session timed out"), } } _ = &mut shutdown_rx => { // PassSession::drop runs here, flushing state before the task exits. warn!(satellite_id, "pass cancelled: shutdown received"); } } Ok(()) } #[tokio::main] async fn main() -> Result<()> { tracing_subscriber::fmt::init(); let (shutdown_tx, shutdown_rx) = oneshot::channel::<()>(); let handle = tokio::spawn(manage_pass( 25544, "gs-svalbard".to_string(), Duration::from_secs(30), shutdown_rx, )); // Simulate shutdown signal after 1 second. tokio::time::sleep(Duration::from_secs(1)).await; let _ = shutdown_tx.send(()); match handle.await { Ok(Ok(())) => info!("task completed"), Ok(Err(e)) => warn!("task error: {e}"), Err(e) if e.is_cancelled() => warn!("task was aborted externally"), Err(e) => warn!("task panicked: {e}"), } Ok(()) }
Key decisions in this code: the Drop impl handles synchronous cleanup, which is guaranteed to run whether the session completes normally, times out, or is cancelled. The select! gives the session three possible exit paths with distinct log entries — observable, diagnosable behavior rather than silent state corruption. The outer .await on the handle distinguishes between clean task exit, application errors, external abort, and panics.
Key Takeaways
-
JoinHandle<T>awaits asResult<T, JoinError>. Distinguish between panics and cancellation usinge.is_panic()/e.is_cancelled(). Never.unwrap()aJoinHandlein production code without a comment explaining the invariant. -
Dropping a
JoinHandledoes not cancel the task. Call.abort()explicitly if you need cancellation on drop..abort()is cooperative — the task stops at its next.awaitpoint, not immediately. -
Async cleanup after an
.awaitis not guaranteed on cancellation. Put mandatory cleanup inDrop(synchronous) or useCancellationTokento intercept the shutdown signal and run async teardown before the task exits. -
tokio::time::timeoutwraps any future with a deadline. On expiry, it cancels the inner future at its next.await. Resources held by the cancelled future are dropped via RAII — no manual cleanup needed if your types implementDropcorrectly. -
tokio::select!runs all branches on the same task — they multiplex, they do not parallelize. Branches randomly compete for selection when multiple are ready, which prevents starvation. Usetokio::spawnwhen you need true independent scheduling; useselect!when you need lightweight concurrency within a single task. -
select!branch preconditions (, if condition) disable a branch before evaluation. Always use them with pinned futures in loops to prevent the "async fn resumed after completion" panic. -
In
select!loops, always include anelsebranch when all active branches use fallible patterns likeSome(...). Theelsebranch fires when all patterns fail to match — typically when all channels are closed — and provides the clean exit condition. -
CancellationToken(fromtokio-util) is the preferred cancellation primitive for cooperative shutdown. Cloning shares the same cancellation event..cancelled().awaitcomposes naturally withselect!and, unlike.abort(), allows the task to run async cleanup before exiting. -
TaskTracker(fromtokio-util) is the preferred drain primitive for shutdown. Spawn tasks through the tracker, call.close()when done spawning, then.wait().awaitto block until all tasks finish. This avoids theJoinHandleVec pattern and correctly handles the close/wait ordering requirement.
Project — Async Telemetry Ingestion Broker
Module: Foundation — M01: Async Rust Fundamentals
Prerequisite: All three module quizzes passed (≥70%)
- Mission Brief
- System Specification
- Expected Output
- Acceptance Criteria
- Frame Format Reference
- Hints
- Reference Implementation
- Reflection
Mission Brief
TO: Platform Engineering
FROM: Mission Control Systems Lead
CLASSIFICATION: UNCLASSIFIED // INTERNAL
SUBJECT: RFC-0041 — Telemetry Ingestion Broker Replacement
The legacy Python telemetry broker is being decommissioned. It accepted connections sequentially on a single thread and could not keep up beyond 12 concurrent ground station feeds. With constellation expansion to 48 LEO satellites and 12 active ground station sites, the broker routinely falls behind during peak pass windows, buffering up to 40 seconds of lag before flushing — unacceptable for conjunction avoidance workflows that require sub-10-second delivery.
Your task is to implement the replacement broker in Rust using Tokio. The broker must accept concurrent TCP connections from ground stations, parse incoming telemetry frames, and fan each frame out to multiple registered downstream handlers — without blocking on any single slow handler.
The broker does not perform conjunction computation. It is a pure ingress and distribution layer. Correctness, throughput, and clean lifecycle management are the acceptance criteria.
System Specification
Connection Model
- Ground stations connect over TCP to a configurable bind address.
- Each connection streams telemetry frames encoded as length-prefixed byte sequences: a 4-byte big-endian
u32length header followed bylengthbytes of payload. - Connections are persistent for the duration of a satellite pass (8–12 minutes). They may drop and reconnect within a pass without notice.
- The broker must handle up to 48 concurrent connections without degradation.
Frame Routing
- Registered downstream handlers receive every frame via a bounded
tokio::sync::broadcastchannel. - If a slow handler's receiver falls behind and the broadcast channel fills, it is the handler's problem — the broker must not block or slow its ingress path to accommodate a slow consumer.
- The broker logs a warning when a receiver falls behind (broadcast returns
RecvError::Lagged).
Lifecycle
- The broker accepts a shutdown signal (a
tokio::sync::watchoroneshotchannel) and performs graceful shutdown:- Stop accepting new connections.
- Signal all active session tasks to drain and exit.
- Wait up to 10 seconds for tasks to finish.
- Force-abort any remaining tasks and exit.
- Session tasks must flush their in-progress frame before shutting down (complete the current frame read, then exit — do not abort mid-frame).
Expected Output
A binary crate (meridian-broker) that:
- Binds a TCP listener on a configurable address (default
0.0.0.0:7777). - Spawns a new async task per incoming connection.
- Each task reads frames using the length-prefix protocol.
- Each parsed frame is sent over a
broadcast::Sender<Frame>. - A configurable number of simulated downstream handler tasks subscribe to the broadcast channel and print/log received frames.
- Ctrl-C triggers graceful shutdown with the sequence described above.
The binary should run, accept at least one connection from telnet or netcat with hand-crafted bytes, and log frame receipt and shutdown cleanly.
Acceptance Criteria
| # | Criterion | Verifiable |
|---|---|---|
| 1 | Broker accepts ≥2 simultaneous TCP connections without either blocking the other | Yes — connect two nc sessions concurrently |
| 2 | Frames are delivered to all registered downstream handlers | Yes — log output shows frame receipt on each handler |
| 3 | A slow downstream handler does not stall frame ingestion on other connections | Yes — add a tokio::time::sleep in one handler; other connections continue at full rate |
| 4 | Ctrl-C triggers graceful shutdown; in-progress frame reads complete before the task exits | Yes — observable in log output |
| 5 | If shutdown drain exceeds 10 seconds, remaining tasks are aborted | Yes — simulate a stuck task and verify the process exits within 11 seconds |
| 6 | No .unwrap() on JoinHandle::await or channel send/receive in production paths | Yes — code review |
| 7 | spawn_blocking is used for any synchronous I/O or CPU-intensive frame processing | Yes — code review |
Frame Format Reference
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Payload Length (u32 BE) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Payload (variable length) |
| ... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
A Frame struct for the purpose of this project:
#![allow(unused)] fn main() { #[derive(Clone, Debug)] pub struct Frame { pub station_id: String, pub payload: Vec<u8>, } }
Hints
Hint 1 — Reading length-prefixed frames
tokio::io::AsyncReadExt provides .read_exact(&mut buf) which reads exactly buf.len() bytes or returns an error. Use it to read the 4-byte header, parse the length, allocate the payload buffer, and read the payload:
#![allow(unused)] fn main() { use tokio::io::AsyncReadExt; use tokio::net::TcpStream; async fn read_frame(stream: &mut TcpStream) -> anyhow::Result<Vec<u8>> { let mut len_buf = [0u8; 4]; stream.read_exact(&mut len_buf).await?; let len = u32::from_be_bytes(len_buf) as usize; let mut payload = vec![0u8; len]; stream.read_exact(&mut payload).await?; Ok(payload) } }
Hint 2 — Broadcast channel for fan-out
tokio::sync::broadcast::channel(capacity) returns (Sender<T>, Receiver<T>). Additional receivers are created with sender.subscribe(). Receivers that fall behind by more than capacity messages receive Err(RecvError::Lagged(n)) — not an error in the fatal sense, just a signal that they missed n messages. Log it and continue receiving.
#![allow(unused)] fn main() { use tokio::sync::broadcast; let (tx, _rx) = broadcast::channel::<Frame>(256); // Downstream handler let mut rx = tx.subscribe(); tokio::spawn(async move { loop { match rx.recv().await { Ok(frame) => { /* process */ } Err(broadcast::error::RecvError::Lagged(n)) => { tracing::warn!(missed = n, "handler fell behind"); } Err(broadcast::error::RecvError::Closed) => break, } } }); }
Hint 3 — Graceful shutdown with watch channel
tokio::sync::watch is well-suited for broadcasting a shutdown signal to an arbitrary number of tasks: one sender, many receivers, each receiver can check the current value or wait for a change.
#![allow(unused)] fn main() { use tokio::sync::watch; let (shutdown_tx, shutdown_rx) = watch::channel(false); // In each session task: let mut shutdown = shutdown_rx.clone(); tokio::select! { result = read_frames(&mut stream) => { /* ... */ } _ = shutdown.changed() => { tracing::info!("shutdown received, finishing current frame"); // complete current frame read if mid-frame, then return } } // In shutdown handler: let _ = shutdown_tx.send(true); }
Hint 4 — Collecting JoinHandles for drain
Keep a Vec<JoinHandle<()>> of spawned session tasks. During shutdown, wrap the drain loop in tokio::time::timeout:
#![allow(unused)] fn main() { let drain_deadline = Duration::from_secs(10); let drain_result = tokio::time::timeout(drain_deadline, async { for handle in session_handles { let _ = handle.await; // ignore individual task errors } }).await; if drain_result.is_err() { tracing::warn!("drain deadline exceeded — some tasks may not have flushed"); } }
After the timeout, the remaining JoinHandles are dropped (tasks continue) or you can collect and abort them explicitly.
Reference Implementation
Reveal reference implementation (attempt the project first)
// src/main.rs use anyhow::Result; use std::sync::Arc; use tokio::{ net::{TcpListener, TcpStream}, sync::{broadcast, watch}, time::{timeout, Duration}, }; use tokio::io::AsyncReadExt; use tracing::{info, warn}; #[derive(Clone, Debug)] pub struct Frame { pub station_id: String, pub payload: Vec<u8>, } async fn read_frame(stream: &mut TcpStream) -> Result<Vec<u8>> { let mut len_buf = [0u8; 4]; stream.read_exact(&mut len_buf).await?; let len = u32::from_be_bytes(len_buf) as usize; if len > 65_536 { // Reject oversized frames — likely a protocol error or malicious client. anyhow::bail!("frame length {len} exceeds maximum 65536 bytes"); } let mut payload = vec![0u8; len]; stream.read_exact(&mut payload).await?; Ok(payload) } async fn handle_connection( mut stream: TcpStream, station_id: String, frame_tx: broadcast::Sender<Frame>, mut shutdown_rx: watch::Receiver<bool>, ) { info!(station = %station_id, "connection established"); loop { tokio::select! { // Bias toward reading frames to minimize partial-frame cancellation. biased; result = read_frame(&mut stream) => { match result { Ok(payload) => { let frame = Frame { station_id: station_id.clone(), payload, }; // Broadcast errors mean all receivers dropped — broker is shutting down. if frame_tx.send(frame).is_err() { break; } } Err(e) => { // EOF or read error — connection dropped. info!(station = %station_id, "connection closed: {e}"); break; } } } _ = shutdown_rx.changed() => { if *shutdown_rx.borrow() { info!(station = %station_id, "shutdown — completing current frame then exiting"); // The biased select ensures we finish the in-progress read if one was started. // On the next iteration, the shutdown branch will win again and we break. break; } } } } info!(station = %station_id, "connection handler exiting"); } fn spawn_handler( id: usize, mut rx: broadcast::Receiver<Frame>, ) -> tokio::task::JoinHandle<()> { tokio::spawn(async move { loop { match rx.recv().await { Ok(frame) => { info!( handler = id, station = %frame.station_id, bytes = frame.payload.len(), "frame received" ); } Err(broadcast::error::RecvError::Lagged(n)) => { warn!(handler = id, missed = n, "handler fell behind — lagged"); } Err(broadcast::error::RecvError::Closed) => { info!(handler = id, "broadcast channel closed, handler exiting"); break; } } } }) } #[tokio::main] async fn main() -> Result<()> { tracing_subscriber::fmt::init(); let bind_addr = "0.0.0.0:7777"; let listener = TcpListener::bind(bind_addr).await?; info!("meridian broker listening on {bind_addr}"); let (frame_tx, _) = broadcast::channel::<Frame>(256); let (shutdown_tx, shutdown_rx) = watch::channel(false); // Spawn 3 downstream handlers. let mut handler_handles: Vec<tokio::task::JoinHandle<()>> = (0..3) .map(|i| spawn_handler(i, frame_tx.subscribe())) .collect(); // Ctrl-C handler. let shutdown_tx = Arc::new(shutdown_tx); let shutdown_tx_ctrlc = shutdown_tx.clone(); tokio::spawn(async move { tokio::signal::ctrl_c().await.expect("failed to listen for ctrl-c"); info!("ctrl-c received — initiating graceful shutdown"); let _ = shutdown_tx_ctrlc.send(true); }); let mut session_handles: Vec<tokio::task::JoinHandle<()>> = Vec::new(); let mut conn_id = 0usize; loop { // Stop accepting new connections once shutdown is signalled. if *shutdown_rx.borrow() { break; } tokio::select! { accept = listener.accept() => { match accept { Ok((stream, addr)) => { conn_id += 1; let station_id = format!("gs-{conn_id}@{addr}"); let handle = tokio::spawn(handle_connection( stream, station_id, frame_tx.clone(), shutdown_rx.clone(), )); session_handles.push(handle); } Err(e) => warn!("accept error: {e}"), } } _ = shutdown_rx.changed() => { if *shutdown_rx.borrow() { break; } } } } info!("draining {} active sessions (10s deadline)", session_handles.len()); // Drop the broadcast sender so downstream handlers see Closed after drain. drop(frame_tx); let drain_result = timeout(Duration::from_secs(10), async { for handle in session_handles { let _ = handle.await; } for handle in handler_handles.drain(..) { let _ = handle.await; } }) .await; if drain_result.is_err() { warn!("drain deadline exceeded — forcing exit"); } else { info!("all tasks drained cleanly"); } Ok(()) }
Cargo.toml dependencies:
[dependencies]
tokio = { version = "1", features = ["full"] }
anyhow = "1"
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
Testing the broker manually:
# Terminal 1: run the broker
RUST_LOG=info cargo run
# Terminal 2: send a frame (4-byte length prefix = 5, then "hello")
printf '\x00\x00\x00\x05hello' | nc localhost 7777
# Terminal 3: send concurrently
printf '\x00\x00\x00\x07meridian' | nc localhost 7777
# Ctrl-C in Terminal 1 to trigger graceful shutdown
Reflection
After completing this project, you have built the entry point for Meridian's control plane ingress. The patterns used here — broadcast fan-out, select!-driven shutdown, bounded drain with timeout, JoinHandle collection — recur throughout the rest of the Foundation modules and into the Data Pipelines track.
Consider for further exploration: what happens if the broker receives 10,000 connections? At what point does the spawn-per-connection model become a problem, and what is the alternative? How would you add backpressure from downstream handlers back to the ingress path without stalling the broker? These questions are the starting point for Module 3 (Message Passing Patterns).
Module 02 — Concurrency Primitives
Track: Foundation — Mission Control Platform
Position: Module 2 of 6
Source material: Rust Atomics and Locks — Mara Bos, Chapters 1–3
Quiz pass threshold: 70% on all three lessons to unlock the project
- Mission Context
- What You Will Learn
- Lessons
- Capstone Project — Ground Station Command Queue
- Prerequisites
- What Comes Next
Mission Context
The Meridian control plane is not a purely async system. The async runtime handles the high-frequency I/O path. But the control plane also runs CPU-bound conjunction checks, synchronous vendor libraries with C FFI, a shared priority command table written by multiple connections and read by the session dispatcher, and lock-free statistics counters sampled by the monitoring dashboard.
The Python system handled shared state with a global threading lock. In six months of operation, that lock has caused four production incidents. This module establishes the Rust concurrency model that replaces it — not by eliminating shared state, but by giving you the type-level guarantees and primitive toolkit to reason about it precisely.
What You Will Learn
By the end of this module you will be able to:
- Distinguish OS threads from async tasks at the scheduling level, and route work to the correct model for its characteristics — blocking work to
spawn_blockingor scoped threads, I/O-bound work to async tasks - Use
SendandSyncto reason about which types can cross thread boundaries, and understand whyRc,Cell, and raw pointers opt out - Implement shared mutable state with
Mutex<T>andRwLock<T>, manageMutexGuardlifetimes correctly, and handle lock poisoning appropriately - Identify the three deadlock patterns that cause most production incidents and apply the structural patterns that prevent them
- Use
tokio::sync::Mutexwhen locks must be held across.awaitpoints in async code - Apply atomic operations (
fetch_add,compare_exchange,load/store) and select the correct memory ordering (Relaxed,Acquire/Release,SeqCst) for the intended guarantee
Lessons
Lesson 1 — Threads vs Async Tasks: When to Use Each and Why
Covers std::thread::spawn vs tokio::spawn, the preemptive vs cooperative scheduling distinction, thread::scope for scoped threads with borrowed data, Send and Sync as the compiler's enforcement mechanism, and spawn_blocking as the bridge between the two models.
Key question this lesson answers: When a piece of work needs to happen concurrently, how do you decide between an OS thread and an async task — and what goes wrong if you choose incorrectly?
→ lesson-01-threads-vs-async.md / lesson-01-quiz.toml
Lesson 2 — Shared State: Mutex, RwLock, and Avoiding Deadlocks
Covers Mutex<T> mechanics (locking, MutexGuard, RAII unlock, lock poisoning), RwLock<T> and the read-heavy access pattern, MutexGuard lifetime pitfalls in async code, tokio::sync::Mutex, Condvar for blocking on data conditions, and the three deadlock patterns with structural prevention strategies.
Key question this lesson answers: How do you share mutable data between threads correctly, and what are the failure modes that Rust's type system does not prevent?
→ lesson-02-shared-state.md / lesson-02-quiz.toml
Lesson 3 — Atomics and Memory Ordering: Acquire/Release/SeqCst in Practice
Covers atomic types and the operations they support (load/store, fetch_add/fetch_sub, compare_exchange), memory ordering (Relaxed, Acquire/Release, AcqRel, SeqCst), the happens-before relationship established by the Acquire/Release pair, and the practical decision of when to use atomics vs a mutex.
Key question this lesson answers: When is a mutex overkill, and how do you safely share single values between threads without locking — including ensuring the processor and compiler do not reorder the operations that matter?
→ lesson-03-atomics.md / lesson-03-quiz.toml
Capstone Project — Ground Station Command Queue
Build a typed, concurrent priority command queue for Meridian's mission operations system. The queue accepts commands from multiple concurrent ground station producer threads, dispatches them to a consumer in priority order with FIFO tie-breaking, blocks producers when at capacity (without dropping commands), exposes lock-free metrics readable without acquiring the queue lock, and shuts down gracefully by draining remaining commands.
Acceptance is against 7 verifiable criteria including correct priority dispatch, non-busy-waiting, lock-free metrics access, and clean shutdown drain.
→ project-command-queue.md
Prerequisites
Module 1 (Async Rust Fundamentals) must be complete. This module assumes you understand how async tasks are scheduled and why blocking an async worker thread is harmful — that understanding is the foundation for the threads-vs-async distinction in Lesson 1 and the async Mutex guidance in Lesson 2.
What Comes Next
Module 3 — Message Passing Patterns builds the next layer: rather than sharing state between concurrent actors, you pass ownership of data through channels. The command queue from this module's project is extended in Module 3 with a tokio::sync::mpsc front-end, moving backpressure into async channel semantics.
Lesson 1 — Threads vs Async Tasks: When to Use Each and Why
Module: Foundation — M02: Concurrency Primitives
Position: Lesson 1 of 3
Source: Rust Atomics and Locks — Mara Bos, Chapter 1
Context
The Meridian control plane is not a purely async system. The async runtime handles the high-frequency path — accepting ground station connections, reading telemetry frames, routing them to downstream consumers. But the control plane also runs work that has no business on an async worker thread: a vendor-supplied TLE validation library with a synchronous C FFI, a CPU-intensive conjunction check that processes several hundred orbital elements per pass, and a legacy configuration parser that performs synchronous file I/O.
The Python system handled this by running everything on threads, leaning on the GIL to serialize concurrent access. The Rust replacement needs a deliberate model. The first decision you make when writing any new piece of the control plane is: does this go on an async task or an OS thread? Getting this wrong produces either a system that starves its async executor with blocking work, or one that spawns OS threads unnecessarily, paying per-thread stack overhead at scale.
This lesson establishes the model. Every rule here has a corresponding failure mode that has been observed in Meridian's staging environment.
Source: Rust Atomics and Locks, Chapter 1 (Bos)
Core Concepts
The Fundamental Difference
An OS thread is scheduled by the kernel. The kernel decides when it runs, when it is preempted, and which CPU core it runs on. The thread has its own stack (typically 2–8 MB by default), and blocking — whether on I/O, a mutex, or std::thread::sleep — is perfectly safe: the kernel parks the thread and runs something else.
An async task is scheduled by the executor. It runs until it voluntarily yields at an await point. It shares executor worker threads with other tasks. Blocking on the worker thread — calling a synchronous library, running a long computation, sleeping with std::thread::sleep — starves every other task scheduled on that thread. There is no kernel to preempt you and run something else.
This is the core rule: any call that can block a thread for non-trivial time belongs on an OS thread, not on an async worker thread. In Tokio, the mechanism is spawn_blocking, which routes the closure to a dedicated blocking thread pool. From the async side, it looks like an awaitable future. On the execution side, it gets a real OS thread.
std::thread::spawn — Ownership and Lifetimes
std::thread::spawn takes a closure that is Send + 'static. The 'static requirement means the thread cannot borrow from the spawning scope — it must own everything it uses, or access data through shared references that are themselves 'static (like Arc).
use std::thread; use std::sync::Arc; fn main() { let catalog = Arc::new(vec!["ISS", "CSS", "STARLINK-1"]); let handle = thread::spawn({ let catalog = Arc::clone(&catalog); move || { // catalog is owned by this thread — no borrow, no lifetime issue. println!("Thread sees {} objects", catalog.len()); } }); handle.join().unwrap(); println!("Main sees {} objects", catalog.len()); }
The Arc::clone before the move is idiomatic: clone the handle, not the data. The thread gets its own Arc pointer (cheap — one atomic increment), and both threads share the underlying Vec. When both Arcs drop, the Vec deallocates.
thread::scope — Scoped Threads with Borrowed Data
The 'static requirement on spawn prevents borrowing stack data. thread::scope lifts this restriction: threads spawned within a scope are guaranteed to finish before the scope exits, which allows them to borrow data from the enclosing frame.
use std::thread; fn validate_tle_batch(records: &[String]) -> usize { let mid = records.len() / 2; let (left, right) = records.split_at(mid); // Scoped threads can borrow `left` and `right` — no Arc, no clone. thread::scope(|s| { let left_handle = s.spawn(|| left.iter().filter(|r| r.starts_with("1 ")).count()); let right_handle = s.spawn(|| right.iter().filter(|r| r.starts_with("1 ")).count()); // scope blocks here until both threads finish. left_handle.join().unwrap() + right_handle.join().unwrap() }) } fn main() { let records: Vec<String> = (0..100) .map(|i| format!("{} {:05}U record", if i % 2 == 0 { "1" } else { "2" }, i)) .collect(); println!("{} valid TLE lines", validate_tle_batch(&records)); }
thread::scope is the right tool for data-parallel CPU work over a borrowed slice — exactly the conjunction check pattern in the Meridian pipeline. No heap allocation, no Arc, no 'static constraint. The compiler enforces that the borrowed data outlives the scope.
Send and Sync — The Type System's Enforcement
Rust enforces thread safety through two marker traits (Rust Atomics and Locks, Ch. 1):
Send: a type is Send if ownership of a value of that type can be transferred to another thread. Arc<T> is Send (if T: Send + Sync), but Rc<T> is not — Rc's reference count is non-atomic and would race if shared across threads.
Sync: a type is Sync if it can be shared between threads by shared reference. i32 is Sync. Cell<i32> is not — mutating through a shared reference is not safe across threads.
The compiler enforces these automatically. You cannot accidentally send a Rc<T> to another thread — thread::spawn requires Send, and Rc does not implement it. You cannot share a RefCell<T> across threads — Mutex<T> requires T: Send, and RefCell does not implement Sync.
Both traits are auto-derived: a struct whose fields are all Send is itself Send. The common exceptions are raw pointers (*const T, *mut T), Rc, Cell, RefCell, and types that wrap OS handles that are not thread-safe. When you implement a type that wraps these, you must opt in to Send/Sync manually with unsafe impl, accepting responsibility for the invariant.
Choosing the Right Model
The decision tree for any piece of work in the control plane:
| Work type | Right model | Mechanism |
|---|---|---|
| Concurrent TCP connections, channel receive/send | Async task | tokio::spawn |
| CPU-bound computation (conjunction check, CRC) | Blocking thread | spawn_blocking |
| Synchronous vendor library (C FFI) | Blocking thread | spawn_blocking |
Synchronous file I/O (std::fs) | Blocking thread | spawn_blocking |
| Data-parallel work over borrowed data | Scoped threads | thread::scope |
| Independent long-running background service | OS thread | thread::spawn |
The cost difference matters at scale. An OS thread on Linux has a default 8 MB stack reservation (even if physical pages are not committed until used), a kernel thread structure, and scheduling overhead. Tokio tasks use a few hundred bytes of heap. The control plane at 48 uplinks can sustain thousands of concurrent tasks trivially; it cannot sustain thousands of OS threads without careful stack-size tuning.
Code Examples
Mixing Async and Blocking: The Vendor TLE Validator
The TLE validation library provided by Meridian's orbit data vendor is a synchronous C library wrapped in a Rust FFI crate. It performs checksum validation and orbital element range checking — purely CPU work, no I/O, but it takes 2–15ms per record depending on complexity. Calling it from an async task would stall the executor for the duration.
use std::time::Duration; use tokio::task; // Simulates a synchronous vendor library call. // In production: calls into the C FFI wrapper. fn validate_tle_sync(line1: &str, line2: &str) -> Result<(), String> { // Vendor library does checksum + orbital element bounds checking. // Blocks for 2–15ms depending on record complexity. std::thread::sleep(Duration::from_millis(5)); // placeholder if line1.starts_with("1 ") && line2.starts_with("2 ") { Ok(()) } else { Err(format!("malformed TLE: {line1}")) } } async fn validate_tle_async(line1: String, line2: String) -> Result<(), String> { // Move strings into the blocking closure. // spawn_blocking runs on the dedicated blocking thread pool — // async worker threads are not touched. task::spawn_blocking(move || validate_tle_sync(&line1, &line2)) .await // JoinError means the blocking thread panicked. .map_err(|e| format!("validator panicked: {e}"))? } #[tokio::main] async fn main() { // All 48 sessions can submit validation concurrently. // Each runs on the blocking pool; none stall the async workers. let tasks: Vec<_> = (0..6).map(|i| { tokio::spawn(validate_tle_async( format!("1 {:05}U 98067A 21275.52 .00001234 00000-0 12345-4 0 999{i}", i), format!("2 {:05} 51.6400 337.6640 0007417 62.6000 297.5200 15.4888958300000{i}", i), )) }).collect(); for (i, t) in tasks.into_iter().enumerate() { match t.await.unwrap() { Ok(()) => println!("record {i}: valid"), Err(e) => println!("record {i}: {e}"), } } }
Scoped Threads for Parallel Conjunction Screening
The conjunction screening pass runs every 10 minutes against the full 50k-object catalog. It splits the catalog across CPU cores using scoped threads. The catalog is a large Vec<OrbitalRecord> — no clone, no Arc, just borrowed slices distributed across workers.
use std::thread; #[derive(Clone)] struct OrbitalRecord { norad_id: u32, altitude_km: f64, } struct ConjunctionAlert { object_a: u32, object_b: u32, closest_approach_km: f64, } fn screen_shard(shard: &[OrbitalRecord], threshold_km: f64) -> Vec<ConjunctionAlert> { // Simplified: real implementation computes relative positions via SGP4. shard.windows(2) .filter(|pair| (pair[0].altitude_km - pair[1].altitude_km).abs() < threshold_km) .map(|pair| ConjunctionAlert { object_a: pair[0].norad_id, object_b: pair[1].norad_id, closest_approach_km: (pair[0].altitude_km - pair[1].altitude_km).abs(), }) .collect() } fn run_conjunction_screen(catalog: &[OrbitalRecord], threshold_km: f64) -> Vec<ConjunctionAlert> { let num_cores = thread::available_parallelism() .map(|n| n.get()) .unwrap_or(4); let shard_size = (catalog.len() + num_cores - 1) / num_cores; thread::scope(|s| { let handles: Vec<_> = catalog .chunks(shard_size) .map(|shard| s.spawn(move || screen_shard(shard, threshold_km))) .collect(); handles.into_iter() .flat_map(|h| h.join().unwrap()) .collect() }) } fn main() { let catalog: Vec<OrbitalRecord> = (0..1000) .map(|i| OrbitalRecord { norad_id: i, altitude_km: 400.0 + (i as f64 * 0.3) }) .collect(); let alerts = run_conjunction_screen(&catalog, 5.0); println!("{} conjunction alerts generated", alerts.len()); }
Each shard runs on its own OS thread via thread::scope, borrowing its slice without any heap allocation for sharing. The scope blocks until all workers finish, then results are collected. This is the correct pattern for data-parallel CPU work where all input data is available upfront and results need to be aggregated.
Key Takeaways
-
OS threads are preemptively scheduled by the kernel. Async tasks are cooperatively scheduled by the executor. Blocking on an async worker thread — any call that does not yield at
await— starves other tasks on that thread. -
Use
spawn_blockingfor any synchronous, blocking, or CPU-intensive work that originates in an async context. It routes work to a dedicated thread pool separate from the async workers. -
thread::scopeallows scoped threads to borrow data from the enclosing frame withoutArcor'staticconstraints. It is the right tool for data-parallel work over borrowed slices. The scope blocks until all spawned threads finish. -
SendandSyncare marker traits enforced at compile time.Sendpermits transferring ownership across threads;Syncpermits sharing by reference. Violating these constraints — sendingRc, sharingCell— is a compile error, not a runtime race. -
The thread vs async decision is about scheduling model, not concurrency. Both models run work concurrently. The difference is what happens when work blocks: OS threads can block safely; async tasks cannot.
Lesson 2 — Shared State: Mutex, RwLock, and Avoiding Deadlocks
Module: Foundation — M02: Concurrency Primitives
Position: Lesson 2 of 3
Source: Rust Atomics and Locks — Mara Bos, Chapter 1
Context
The Meridian command queue maintains a shared priority table: incoming operator commands are written by the command ingress task, read by the session dispatch task, and occasionally queried by the monitoring dashboard. The Python system used a global dictionary with a threading lock. In production, that lock has been involved in three separate deadlock incidents — two in the same deployment week — all caused by the same root pattern: lock acquired, function called, that function also acquires the same lock.
Rust does not prevent deadlocks at compile time. But it gives you the tools to reason about them precisely: Mutex<T> and RwLock<T> make the protected data visible in the type signature, MutexGuard makes it impossible to access data without holding the lock, and RAII makes it impossible to forget to release it. This lesson covers how these primitives work, the failure modes that remain after Rust's type system has done its job, and the patterns that prevent them.
Source: Rust Atomics and Locks, Chapter 1 (Bos)
Core Concepts
Mutex<T> — Exclusive Access with RAII
std::sync::Mutex<T> wraps a value of type T and enforces that only one thread can access it at a time. The data is inaccessible without locking. There is no way to accidentally read T without going through .lock().
.lock() returns LockResult<MutexGuard<'_, T>>. The MutexGuard dereferences to T and automatically releases the lock when it drops. There is no .unlock() method. The lock is released when the guard goes out of scope — or, critically, when it is explicitly dropped.
use std::sync::{Arc, Mutex}; use std::thread; fn main() { let command_count = Arc::new(Mutex::new(0u64)); let handles: Vec<_> = (0..4).map(|_| { let counter = Arc::clone(&command_count); thread::spawn(move || { for _ in 0..1000 { // Lock is acquired here. Guard is dropped at end of block. let mut count = counter.lock().unwrap(); *count += 1; // Guard dropped here — lock released before next iteration. } }) }).collect(); for h in handles { h.join().unwrap(); } println!("commands processed: {}", command_count.lock().unwrap()); }
The Arc provides shared ownership across threads (Rc is not Send and will not compile here). The Mutex provides exclusive access. This is the standard pattern for shared mutable state between threads.
Lock Poisoning
When a thread panics while holding a Mutex lock, the mutex is marked poisoned. Subsequent calls to .lock() return Err(PoisonError). The data is still accessible through the error — err.into_inner() returns the MutexGuard — but the poison signals that the data may be in an inconsistent state.
In practice, most Meridian code uses .unwrap() on mutex locks. This is deliberate: if a thread panics while holding the command queue lock, it is not safe to continue operating on potentially corrupted queue state. Propagating the panic is the correct response. The cases where you would recover from a poisoned mutex are rare and require domain-specific knowledge about what "inconsistent" means for that data.
One place where .unwrap() is wrong: in a test or in a thread that genuinely needs to clean up a partially-written state. In those cases, match on the LockResult explicitly.
MutexGuard Lifetime — A Common Bug
The most common Mutex bug in Rust code is holding a guard longer than intended, or — worse — holding it across an .await point in async code. A guard held across an await parks the lock for the duration of the async operation. If another task tries to acquire the same lock, it will block the async worker thread (since std::sync::Mutex::lock blocks, not yields).
use std::sync::Mutex; fn main() { let data = Mutex::new(vec![1u32, 2, 3]); // BUG: guard lives to end of the if block, holding lock during the push { let guard = data.lock().unwrap(); if guard.contains(&2) { drop(guard); // Must explicitly drop before re-locking. data.lock().unwrap().push(4); } // Without the explicit drop, this deadlocks: the guard is still // alive when we try to lock again at data.lock().unwrap().push(4) } println!("{:?}", data.lock().unwrap()); }
In async code, use tokio::sync::Mutex instead of std::sync::Mutex. It yields to the executor while waiting for the lock rather than blocking the thread. Conversely, never hold a tokio::sync::MutexGuard across a .await that might block for a long time — you are holding the lock for the duration of that await, which blocks all other lock waiters.
RwLock<T> — Read Concurrency, Write Exclusivity
RwLock<T> distinguishes between reads and writes. Multiple readers can hold the lock simultaneously; a writer requires exclusive access. This is the concurrent version of RefCell.
It is appropriate when reads are frequent and writes are rare. For the Meridian session state table: many tasks read current session state, but writes only happen when sessions start or end. An RwLock allows those many concurrent reads without serializing them.
use std::collections::HashMap; use std::sync::{Arc, RwLock}; use std::thread; type SessionTable = Arc<RwLock<HashMap<u32, String>>>; fn register_session(table: &SessionTable, id: u32, station: String) { // Write lock — exclusive. table.write().unwrap().insert(id, station); } fn query_session(table: &SessionTable, id: u32) -> Option<String> { // Read lock — concurrent with other readers. table.read().unwrap().get(&id).cloned() } fn main() { let table: SessionTable = Arc::new(RwLock::new(HashMap::new())); register_session(&table, 25544, "gs-svalbard".into()); let readers: Vec<_> = (0..4).map(|_| { let t = Arc::clone(&table); thread::spawn(move || { // All four reader threads can hold the read lock simultaneously. println!("{:?}", query_session(&t, 25544)); }) }).collect(); for r in readers { r.join().unwrap(); } }
RwLock is not always faster than Mutex. If writes are frequent, readers pay the overhead of checking for pending writers. On some platforms, RwLock can starve writers if readers continuously hold the lock. Profile before committing to RwLock as an optimisation. For the common case of a hot write path with rare reads, Mutex is simpler and often faster.
Deadlock Patterns and How to Prevent Them
A deadlock requires at least two resources and two threads acquiring them in opposite order. Rust's type system does not prevent this. Three patterns cause the vast majority of deadlocks in production:
Lock ordering violation: Thread A acquires lock 1 then lock 2. Thread B acquires lock 2 then lock 1. Each holds what the other needs. Prevention: establish a global lock acquisition order and document it. If the command queue lock must always be acquired before the session table lock, enforce that convention in code review.
Re-entrant locking:
std::sync::Mutex is not reentrant. A thread that calls .lock() on a mutex it already holds will deadlock immediately — there is no second locking that succeeds. This is the source of Meridian's production incidents: a function that acquires the lock, calls a helper, and the helper also acquires the same lock.
Prevention: keep lock-holding code flat. Do not call functions while holding a lock unless you can verify they do not acquire the same lock. If a function is callable both with and without a lock held, split it into two versions or restructure the locking scope.
Holding guards across blocking calls:
In synchronous code: holding a MutexGuard while calling a function that blocks on I/O. In async code: holding a std::sync::MutexGuard across an .await.
Prevention: minimize the scope of guards. Acquire, mutate, release. Do not hold a lock while doing I/O. In async code, use tokio::sync::Mutex or restructure to release the lock before awaiting.
Code Examples
The Meridian Priority Command Queue
The command queue receives operator commands from the ground network interface. Commands have integer priorities. The session dispatcher reads the highest-priority pending command. Multiple ground network connections can write concurrently.
use std::collections::BinaryHeap; use std::cmp::Reverse; use std::sync::{Arc, Mutex, Condvar}; use std::thread; use std::time::Duration; #[derive(Eq, PartialEq)] struct Command { priority: u8, payload: String, } impl Ord for Command { fn cmp(&self, other: &Self) -> std::cmp::Ordering { // Higher priority = higher value in max-heap. self.priority.cmp(&other.priority) } } impl PartialOrd for Command { fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> { Some(self.cmp(other)) } } struct CommandQueue { // Mutex + Condvar is the standard pattern for blocking producers/consumers. inner: Mutex<BinaryHeap<Command>>, available: Condvar, } impl CommandQueue { fn new() -> Arc<Self> { Arc::new(Self { inner: Mutex::new(BinaryHeap::new()), available: Condvar::new(), }) } fn push(&self, cmd: Command) { self.inner.lock().unwrap().push(cmd); // Notify one waiting consumer that data is available. self.available.notify_one(); } fn pop_blocking(&self) -> Command { let mut queue = self.inner.lock().unwrap(); // Condvar::wait releases the mutex and blocks until notified, // then reacquires the mutex before returning. loop { if let Some(cmd) = queue.pop() { return cmd; } queue = self.available.wait(queue).unwrap(); } } } fn main() { let queue = CommandQueue::new(); // Producer threads simulate ground network connections. let producers: Vec<_> = (0..3).map(|i| { let q = Arc::clone(&queue); thread::spawn(move || { thread::sleep(Duration::from_millis(i * 10)); q.push(Command { priority: (i as u8 % 3) + 1, payload: format!("CMD-{i:04}"), }); println!("producer {i}: pushed priority {}", (i as u8 % 3) + 1); }) }).collect(); // Consumer runs on a separate thread — simulates session dispatcher. let q = Arc::clone(&queue); let consumer = thread::spawn(move || { for _ in 0..3 { let cmd = q.pop_blocking(); println!("dispatcher: executing '{}' (priority {})", cmd.payload, cmd.priority); } }); for p in producers { p.join().unwrap(); } consumer.join().unwrap(); }
The Condvar solves the busy-wait problem: without it, the consumer would spin-lock on queue.is_empty(), wasting CPU. Condvar::wait atomically releases the mutex and parks the thread, then reacquires the mutex before returning. The .unwrap() on lock() is intentional: if a producer panics while holding the lock, corrupting the queue, the consumer should not continue silently.
Key Takeaways
-
Mutex<T>makes protected data inaccessible without locking.MutexGuardis the only way to reach the data, and it releases the lock on drop. There is no way to forget to unlock — but there are ways to hold the lock longer than intended. -
Lock poisoning marks a mutex as potentially inconsistent when a thread panics while holding it. Most production code uses
.unwrap()on locks, propagating the panic. Recover from a poisoned mutex only when you can correct the inconsistent state. -
RwLock<T>allows concurrent reads and exclusive writes. It is appropriate when reads are dominant. It is not always faster thanMutexon write-heavy paths — profile before optimizing. -
Three deadlock patterns cover most production incidents: lock ordering violations (acquire in inconsistent order across threads), re-entrant locking (acquiring a lock you already hold), and holding guards across blocking calls. Document lock acquisition order and minimize guard scope.
-
In async code,
std::sync::Mutex::lockblocks the OS thread, which parks the async worker. Usetokio::sync::Mutexwhen the lock may be contended and the wait must yield to the executor. Never hold anyMutexGuardacross a slow.await. -
Condvaris the correct primitive for blocking on a data condition (waiting for a non-empty queue, waiting for a flag). It atomically releases the mutex and parks the thread, avoiding busy-waiting.
Lesson 3 — Atomics and Memory Ordering: Acquire/Release/SeqCst in Practice
Module: Foundation — M02: Concurrency Primitives
Position: Lesson 3 of 3
Source: Rust Atomics and Locks — Mara Bos, Chapters 2–3
Context
The Meridian control plane increments a frame counter every time a telemetry frame is received — 4,800 times per second at full uplink load across 48 satellites. The per-session heartbeat timer fires every 100ms. The frame drop rate is sampled by the monitoring dashboard every second. None of these operations need the overhead of a mutex lock. They need a single integer that multiple threads can read and write without data races.
This is the domain of atomics. std::sync::atomic provides integer and boolean types that support safe concurrent mutation without locking. The operations are indivisible — they either complete entirely or have not happened yet — which prevents the torn reads and non-atomic increments that would corrupt counters under concurrent access.
But atomics are not free. The memory ordering argument on every atomic operation — Relaxed, Acquire, Release, AcqRel, SeqCst — controls what guarantees the processor and compiler make about the ordering of operations across threads. Getting this wrong produces bugs that are invisible in development and intermittent in production.
Source: Rust Atomics and Locks, Chapters 2–3 (Bos)
Core Concepts
What Atomic Operations Guarantee
An atomic operation is indivisible: it either completes entirely before any other operation on the same variable, or it has not happened yet (Rust Atomics and Locks, Ch. 2). Two threads simultaneously performing counter += 1 on a plain integer is undefined behavior — the read-modify-write is three separate operations, and the interleaving is unpredictable. Two threads simultaneously calling counter.fetch_add(1, Relaxed) is defined and correct: each fetch_add is a single atomic step.
The available types live in std::sync::atomic: AtomicBool, AtomicI8/U8 through AtomicI64/U64, AtomicIsize/Usize, and AtomicPtr<T>. All support mutation through a shared reference (&AtomicUsize) — they use interior mutability without UnsafeCell runtime checks.
Every atomic operation takes an Ordering argument. The ordering is not about the value — it is about the visibility of other memory operations to other threads.
Load, Store, and Fetch-and-Modify
The three basic operation families:
Load and store — read or write the atomic value:
#![allow(unused)] fn main() { use std::sync::atomic::{AtomicU64, Ordering::Relaxed}; static FRAME_COUNT: AtomicU64 = AtomicU64::new(0); fn record_frame() { FRAME_COUNT.fetch_add(1, Relaxed); } fn read_frame_count() -> u64 { FRAME_COUNT.load(Relaxed) } }
Fetch-and-modify — atomically modify the value and return the previous value (Rust Atomics and Locks, Ch. 2):
use std::sync::atomic::{AtomicU64, Ordering::Relaxed}; fn main() { let counter = AtomicU64::new(100); let old = counter.fetch_add(23, Relaxed); assert_eq!(old, 100); // returned the value before the add assert_eq!(counter.load(Relaxed), 123); // value after the add }
The full set: fetch_add, fetch_sub, fetch_and, fetch_or, fetch_xor, fetch_max, fetch_min, and swap. Use these in preference to compare-and-exchange when the operation fits — they are simpler and the compiler can map them to a single hardware instruction.
compare_exchange — The General Atomic Primitive
compare_exchange atomically checks whether the current value equals an expected value, and if so, replaces it with a new value. It returns the previous value on success, and the actual current value on failure (Rust Atomics and Locks, Ch. 2):
use std::sync::atomic::{AtomicU32, Ordering::Relaxed}; fn increment_if_below(a: &AtomicU32, limit: u32) -> bool { let mut current = a.load(Relaxed); loop { if current >= limit { return false; } match a.compare_exchange(current, current + 1, Relaxed, Relaxed) { Ok(_) => return true, // successfully incremented Err(v) => current = v, // another thread changed it; retry } } } fn main() { let seq = AtomicU32::new(0); println!("{}", increment_if_below(&seq, 5)); // true }
The loop-and-retry pattern is fundamental: load the current value, compute the desired new value without holding any lock, then swap atomically only if the value has not changed since the load. If it has changed, retry. This is a lock-free algorithm — no thread blocks, and progress is guaranteed as long as at least one thread makes progress.
compare_exchange_weak may spuriously fail (return Err even when the value matches) on some architectures. Use it in loops where spurious failure just triggers another iteration. Use the strong version when you need a guarantee that success or failure is definitive.
The ABA problem: if a value changes from A to B and back to A between the load and the CAS, compare_exchange will succeed even though the value was modified. For simple counters and flags this is harmless; for pointer-based data structures it can be a correctness issue.
Memory Ordering — The Model
Processors and compilers reorder operations when it does not change single-threaded program behavior. In concurrent code, these reorderings can change observed behavior across threads. Memory ordering tells the compiler and processor what reorderings are permissible around a given atomic operation (Rust Atomics and Locks, Ch. 3).
Relaxed — no ordering guarantees beyond consistency on the single atomic variable. All threads see modifications of a given atomic in the same total order, but operations on different variables may be reordered arbitrarily. Use for statistics counters and progress indicators where you only care about the eventual value, not the timing relationship with other operations.
Release (stores) / Acquire (loads) — the most important pair. A release-store establishes a happens-before relationship with any subsequent acquire-load that reads the stored value:
use std::sync::atomic::{AtomicBool, AtomicU64, Ordering::{Acquire, Release, Relaxed}}; use std::thread; static DATA: AtomicU64 = AtomicU64::new(0); static READY: AtomicBool = AtomicBool::new(false); fn main() { thread::spawn(|| { DATA.store(12345, Relaxed); // (1) write data READY.store(true, Release); // (2) publish: everything before this is visible... }); while !READY.load(Acquire) { // (3) ...once this returns true. std::hint::spin_loop(); } println!("{}", DATA.load(Relaxed)); // guaranteed to print 12345 }
Once the acquire-load at (3) sees true, the happens-before relationship guarantees that (1) is visible. Without the Acquire/Release pair — using Relaxed on both — the processor could see READY as true while DATA still holds 0.
The names come from the mutex pattern: a mutex unlock is a release-store; a mutex lock-acquire is an acquire-load. Everything the thread did before releasing the mutex is visible to the thread that acquires it next.
AcqRel — both Acquire and Release in a single operation. Used for read-modify-write operations (like fetch_add or compare_exchange) that must both see all prior releases and publish all prior stores.
SeqCst — sequentially consistent: the strongest ordering. All SeqCst operations across all threads form a single total order that every thread agrees on. This is stronger than Acquire/Release and is rarely needed. Use it when you have two threads each setting a flag and then reading the other's flag, and you need to guarantee that at least one thread sees the other's write (Rust Atomics and Locks, Ch. 3). In nearly all other cases, Acquire/Release is sufficient.
When to Reach for Atomics vs Mutex
Atomics are not a general replacement for mutexes. They are appropriate for:
- Single-value counters and flags (frame counts, connection counts, shutdown flags)
- Lock-free reference counting (the internal mechanism of
Arc) - Progress indicators shared between threads
- Single-producer/single-consumer patterns where acquire/release establishes the necessary ordering
Mutexes are appropriate for:
- Protecting multi-field structs where all fields must be updated atomically
- Any operation that requires a multi-step transaction
- Data structures that cannot be represented as a single atomic value
Reaching for SeqCst everywhere is not safe by default — it has higher cost on some architectures (notably ARM) and the extra strength is rarely needed. Start with Acquire/Release. If your correctness argument requires a global total order across multiple atomics, then SeqCst is warranted.
Code Examples
Multi-Thread Frame Counter with Atomic Statistics
The telemetry pipeline tracks three counters: total frames received, total frames dropped (due to backpressure), and bytes processed. These are written by 48 uplink tasks and read by the monitoring dashboard. A mutex would serialize all 48 writes; atomics let them proceed in parallel.
use std::sync::atomic::{AtomicU64, Ordering::{Relaxed, Release, Acquire}}; use std::sync::Arc; use std::thread; use std::time::Duration; struct PipelineMetrics { frames_received: AtomicU64, frames_dropped: AtomicU64, bytes_processed: AtomicU64, // Shutdown flag: Release on write, Acquire on read. shutdown: AtomicU64, } impl PipelineMetrics { fn new() -> Arc<Self> { Arc::new(Self { frames_received: AtomicU64::new(0), frames_dropped: AtomicU64::new(0), bytes_processed: AtomicU64::new(0), shutdown: AtomicU64::new(0), }) } fn record_frame(&self, bytes: u64) { // Relaxed: these counters are for monitoring only. // The exact ordering relative to other threads' stores doesn't matter; // we only care about the eventual totals. self.frames_received.fetch_add(1, Relaxed); self.bytes_processed.fetch_add(bytes, Relaxed); } fn record_drop(&self) { self.frames_dropped.fetch_add(1, Relaxed); } fn signal_shutdown(&self) { // Release: ensures all frame counts written before this are visible // to any thread that reads shutdown with Acquire. self.shutdown.store(1, Release); } fn should_stop(&self) -> bool { // Acquire: establishes happens-before with the Release store above. // Any Relaxed loads on frames_received etc. after this call // will see all stores from before signal_shutdown(). self.shutdown.load(Acquire) == 1 } fn snapshot(&self) -> (u64, u64, u64) { ( self.frames_received.load(Relaxed), self.frames_dropped.load(Relaxed), self.bytes_processed.load(Relaxed), ) } } fn main() { let metrics = PipelineMetrics::new(); // Simulate 4 uplink tasks. let workers: Vec<_> = (0..4).map(|i| { let m = Arc::clone(&metrics); thread::spawn(move || { for _ in 0..100 { if m.should_stop() { break; } m.record_frame(1024); if i == 0 { m.record_drop(); } // simulate occasional drops on uplink 0 } }) }).collect(); // Monitoring thread samples every 5ms. let m = Arc::clone(&metrics); let monitor = thread::spawn(move || { for _ in 0..3 { thread::sleep(Duration::from_millis(5)); let (recv, drop, bytes) = m.snapshot(); println!("recv={recv} drop={drop} bytes={bytes}"); } m.signal_shutdown(); }); for w in workers { w.join().unwrap(); } monitor.join().unwrap(); let (recv, drop, bytes) = metrics.snapshot(); println!("final: recv={recv} drop={drop} bytes={bytes}"); }
The Acquire/Release pair on the shutdown flag ensures that after any thread reads should_stop() as true, all Relaxed frame counts written before signal_shutdown() are visible. Without this pair, the monitoring thread could read shutdown=1 but still see stale frame counts from before the shutdown writes.
Key Takeaways
-
Atomic operations are indivisible: a
fetch_addon anAtomicU64is a single step with no observable intermediate state. Plain integer+=is not atomic — concurrent modification is undefined behavior. -
fetch_addand friends return the value before the operation. This is intentional: it lets you use the old value to implement compare-and-swap patterns or sequence counters. -
compare_exchangeis the general-purpose lock-free primitive. The loop-and-retry pattern — load, compute, CAS, retry on failure — enables lock-free algorithms where no thread ever blocks. -
Relaxedordering gives only modification order on a single variable. It is correct for statistics counters and progress indicators where cross-variable ordering does not matter. -
Acquire/Release establishes happens-before across threads. A release-store publishes all preceding memory operations; an acquire-load that reads that value sees all of them. This is what makes mutex unlock/lock,
Arcdrop/clone, and cross-thread data handoffs safe. -
SeqCstprovides a global total order across allSeqCstoperations on all threads. Use it only when you need to coordinate two or more flags where the relative order matters globally. In practice, Acquire/Release covers the vast majority of use cases.
Project — Ground Station Command Queue
Module: Foundation — M02: Concurrency Primitives
Prerequisite: All three module quizzes passed (≥70%)
- Mission Brief
- System Specification
- Expected Output
- Acceptance Criteria
- Hints
- Reference Implementation
- Reflection
Mission Brief
TO: Platform Engineering
FROM: Mission Operations Lead
CLASSIFICATION: UNCLASSIFIED // INTERNAL
SUBJECT: RFC-0044 — Priority Command Queue Implementation
The legacy Python command queue used a global dictionary with a threading lock. Over the past six months it has been involved in four production incidents: two deadlocks from re-entrant locking, one priority inversion where a low-priority housekeeping command blocked an emergency SAFE_MODE injection, and one data race when a monitoring process read the queue mid-write.
The replacement must be a typed, concurrent priority queue in Rust. It accepts mission-critical commands from multiple concurrent ground network connections, dispatches them in priority order to the session controller, and exposes lock-free metrics to the monitoring system — without the failure modes of the Python implementation.
System Specification
Command Model
Commands have a u8 priority (0 = lowest, 255 = highest). Predefined priorities:
| Command type | Priority |
|---|---|
SAFE_MODE | 255 |
ABORT_PASS | 200 |
REPOINT | 100 |
STATUS_REQUEST | 50 |
HOUSEKEEPING | 10 |
A Command struct:
#![allow(unused)] fn main() { #[derive(Debug, Eq, PartialEq)] pub struct Command { pub priority: u8, pub kind: CommandKind, pub issued_at: std::time::Instant, } #[derive(Debug, Eq, PartialEq)] pub enum CommandKind { SafeMode, AbortPass, Repoint { azimuth: f32, elevation: f32 }, StatusRequest, Housekeeping, } }
Queue Behaviour
- Multiple producer threads (one per ground station connection) push commands concurrently.
- One consumer thread (the session dispatcher) pops the highest-priority command. If multiple commands share the same priority, the oldest (by
issued_at) is dispatched first. - When the queue is empty, the consumer blocks without busy-waiting.
- The queue has a configurable capacity. If full, a push from a producer blocks until space is available. Blocking producers must not block the consumer.
Metrics
The following counters are maintained lock-free and available to the monitoring system without acquiring any lock:
commands_pushed— total commands ever pushed (all priorities)commands_dispatched— total commands ever dispatchedsafe_mode_count— number of SAFE_MODE commands dispatched (priority 255)
Shutdown
The queue accepts a shutdown signal. On shutdown:
- No new pushes are accepted — producers get an
Err(QueueShutdown). - The consumer drains any remaining commands in priority order.
- Once the queue is empty and shutdown is signalled, the consumer returns.
Expected Output
A library crate (meridian-cmdqueue) with:
- A
CommandQueuetype withpush,pop, andshutdownmethods - An
Arc<Metrics>accessible from theCommandQueuewith the three lock-free counters - A binary that demonstrates: 3 producer threads pushing 5 commands each, 1 consumer thread dispatching them in priority order, a monitoring thread sampling metrics every 20ms, and shutdown after all producers finish
The output should clearly show commands being dispatched in priority order (not FIFO).
Acceptance Criteria
| # | Criterion | Verifiable |
|---|---|---|
| 1 | Commands dispatched in priority order (highest first, then oldest-first within priority) | Yes — log output order |
| 2 | Consumer blocks without busy-waiting when queue is empty | Yes — no >5% CPU when idle (measure with top) |
| 3 | Multiple concurrent producers do not cause data races | Yes — runs under cargo test with --test-threads=1 via loom or ThreadSanitizer |
| 4 | Metrics counters are readable without acquiring any queue lock | Yes — code review: metrics accessed via atomics only |
| 5 | Shutdown drains remaining commands before consumer exits | Yes — log shows all pushed commands dispatched before exit |
| 6 | Producer push blocks (does not drop commands) when queue is at capacity | Yes — test with capacity=2 and 10 concurrent pushes |
| 7 | No .unwrap() on Mutex::lock() without a comment on the invariant | Yes — code review |
Hints
Hint 1 — Implementing priority + FIFO ordering on BinaryHeap
BinaryHeap is a max-heap. To get "highest priority first, then oldest first within the same priority," implement Ord on Command to compare first by priority (descending), then by issued_at (ascending — older is higher priority):
#![allow(unused)] fn main() { use std::cmp::Ordering; use std::time::Instant; struct Command { priority: u8, issued_at: Instant } impl Ord for Command { fn cmp(&self, other: &Self) -> Ordering { self.priority.cmp(&other.priority) .then_with(|| other.issued_at.cmp(&self.issued_at)) // older = higher } } impl PartialOrd for Command { fn partial_cmp(&self, other: &Self) -> Option<Ordering> { Some(self.cmp(other)) } } impl PartialEq for Command { fn eq(&self, other: &Self) -> bool { self.priority == other.priority && self.issued_at == other.issued_at } } impl Eq for Command {} }
Hint 2 — Blocking push with capacity using Mutex + two Condvars
Two Condvars: one signals "not full" (wake a blocked producer), one signals "not empty" (wake the consumer):
#![allow(unused)] fn main() { use std::sync::{Mutex, Condvar}; use std::collections::BinaryHeap; struct QueueInner<T> { heap: BinaryHeap<T>, capacity: usize, shutdown: bool, } struct CommandQueue<T> { inner: Mutex<QueueInner<T>>, not_empty: Condvar, not_full: Condvar, } }
Push blocks on not_full when the heap is at capacity; pop blocks on not_empty when the heap is empty. Each operation notifies the other condvar after completing.
Hint 3 — Lock-free metrics with atomics
Counters increment in push and pop, which both hold the mutex. But the monitoring thread must read without the mutex. Use atomics for all three counters — write from inside the locked section (ordering is Relaxed since the mutex provides the actual happens-before relationship), read from the monitoring thread with Relaxed:
#![allow(unused)] fn main() { use std::sync::atomic::{AtomicU64, Ordering::Relaxed}; use std::sync::Arc; pub struct Metrics { pub commands_pushed: AtomicU64, pub commands_dispatched: AtomicU64, pub safe_mode_count: AtomicU64, } impl Metrics { pub fn new() -> Arc<Self> { Arc::new(Self { commands_pushed: AtomicU64::new(0), commands_dispatched: AtomicU64::new(0), safe_mode_count: AtomicU64::new(0), }) } } }
Hint 4 — Shutdown drain sequence
Set shutdown = true in the inner state while holding the mutex, then notify_all() on both condvars. In push, check shutdown after acquiring the lock and return Err if set. In pop, check shutdown && heap.is_empty() — if both are true, return None to signal the consumer to exit:
#![allow(unused)] fn main() { // In pop: let mut inner = self.inner.lock().unwrap(); loop { if let Some(cmd) = inner.heap.pop() { self.not_full.notify_one(); return Some(cmd); } if inner.shutdown { return None; // Queue is empty and shutdown — consumer exits } inner = self.not_empty.wait(inner).unwrap(); } }
Reference Implementation
Reveal reference implementation
#![allow(unused)] fn main() { // src/lib.rs use std::cmp::Ordering; use std::collections::BinaryHeap; use std::sync::atomic::{AtomicU64, Ordering::Relaxed}; use std::sync::{Arc, Condvar, Mutex}; use std::time::Instant; #[derive(Debug)] pub struct QueueShutdown; impl std::fmt::Display for QueueShutdown { fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { write!(f, "command queue is shut down") } } #[derive(Debug, Eq, PartialEq)] pub enum CommandKind { SafeMode, AbortPass, Repoint { azimuth: u32, elevation: u32 }, StatusRequest, Housekeeping, } #[derive(Debug, Eq, PartialEq)] pub struct Command { pub priority: u8, pub kind: CommandKind, pub issued_at: Instant, } impl Ord for Command { fn cmp(&self, other: &Self) -> Ordering { self.priority .cmp(&other.priority) // Within the same priority, older commands go first. .then_with(|| other.issued_at.cmp(&self.issued_at)) } } impl PartialOrd for Command { fn partial_cmp(&self, other: &Self) -> Option<Ordering> { Some(self.cmp(other)) } } pub struct Metrics { pub commands_pushed: AtomicU64, pub commands_dispatched: AtomicU64, pub safe_mode_count: AtomicU64, } impl Metrics { fn new() -> Arc<Self> { Arc::new(Self { commands_pushed: AtomicU64::new(0), commands_dispatched: AtomicU64::new(0), safe_mode_count: AtomicU64::new(0), }) } } struct Inner { heap: BinaryHeap<Command>, capacity: usize, shutdown: bool, } pub struct CommandQueue { inner: Mutex<Inner>, not_empty: Condvar, not_full: Condvar, pub metrics: Arc<Metrics>, } impl CommandQueue { pub fn new(capacity: usize) -> Arc<Self> { Arc::new(Self { inner: Mutex::new(Inner { heap: BinaryHeap::with_capacity(capacity), capacity, shutdown: false, }), not_empty: Condvar::new(), not_full: Condvar::new(), metrics: Metrics::new(), }) } pub fn push(&self, cmd: Command) -> Result<(), QueueShutdown> { let mut inner = self.inner.lock().unwrap(); loop { if inner.shutdown { return Err(QueueShutdown); } if inner.heap.len() < inner.capacity { let is_safe_mode = matches!(cmd.kind, CommandKind::SafeMode); inner.heap.push(cmd); // Relaxed: the mutex provides the happens-before. These are // statistics only — no cross-variable ordering needed. self.metrics.commands_pushed.fetch_add(1, Relaxed); if is_safe_mode { self.metrics.safe_mode_count.fetch_add(1, Relaxed); } self.not_empty.notify_one(); return Ok(()); } // Queue full — block until space opens or shutdown. inner = self.not_full.wait(inner).unwrap(); } } pub fn pop(&self) -> Option<Command> { let mut inner = self.inner.lock().unwrap(); loop { if let Some(cmd) = inner.heap.pop() { self.metrics.commands_dispatched.fetch_add(1, Relaxed); self.not_full.notify_one(); return Some(cmd); } if inner.shutdown { return None; } inner = self.not_empty.wait(inner).unwrap(); } } pub fn shutdown(&self) { let mut inner = self.inner.lock().unwrap(); inner.shutdown = true; // Wake all blocked producers and the consumer. self.not_empty.notify_all(); self.not_full.notify_all(); } } }
// src/main.rs (demonstration binary) use std::sync::Arc; use std::thread; use std::time::{Duration, Instant}; fn main() { // Inline the relevant types for the playground demo // (in the real crate, use `use meridian_cmdqueue::*`) tracing_subscriber::fmt::init(); let queue = CommandQueue::new(20); let metrics = Arc::clone(&queue.metrics); // Three producer threads — simulate ground network connections. let producers: Vec<_> = (0..3u8).map(|gs| { let q = Arc::clone(&queue); thread::spawn(move || { let priorities = [255u8, 200, 100, 50, 10]; for &priority in &priorities { thread::sleep(Duration::from_millis(gs as u64 * 5)); let kind = match priority { 255 => CommandKind::SafeMode, 200 => CommandKind::AbortPass, 100 => CommandKind::Repoint { azimuth: 180, elevation: 45 }, 50 => CommandKind::StatusRequest, _ => CommandKind::Housekeeping, }; match q.push(Command { priority, kind, issued_at: Instant::now() }) { Ok(()) => println!("gs-{gs}: pushed priority {priority}"), Err(e) => println!("gs-{gs}: push rejected — {e}"), } } }) }).collect(); // Consumer thread — session dispatcher. let q = Arc::clone(&queue); let consumer = thread::spawn(move || { while let Some(cmd) = q.pop() { println!("dispatcher: {:?} (priority {})", cmd.kind, cmd.priority); thread::sleep(Duration::from_millis(10)); } println!("dispatcher: queue drained, exiting"); }); // Monitoring thread. let monitor = thread::spawn(move || { for _ in 0..4 { thread::sleep(Duration::from_millis(20)); println!( "metrics: pushed={} dispatched={} safe_mode={}", metrics.commands_pushed.load(std::sync::atomic::Ordering::Relaxed), metrics.commands_dispatched.load(std::sync::atomic::Ordering::Relaxed), metrics.safe_mode_count.load(std::sync::atomic::Ordering::Relaxed), ); } }); for p in producers { p.join().unwrap(); } queue.shutdown(); consumer.join().unwrap(); monitor.join().unwrap(); }
Reflection
The command queue built here uses all three concurrency layers from this module: OS threads for the producer and consumer, Mutex + Condvar for blocking coordination, and atomics for the metrics that must be readable without acquiring any lock. The relationship between these layers — the mutex providing the happens-before for the atomic writes, the condvar providing the non-busy-waiting block, the atomics avoiding any lock on the read path — is the pattern used throughout the Meridian control plane.
The natural next question: the blocking push is correct but puts an upper bound on producer throughput. In Module 3, this queue is extended with a tokio::sync::mpsc front-end that moves the backpressure into async channel semantics rather than blocking OS threads.
Module 03 — Message Passing Patterns
Track: Foundation — Mission Control Platform
Position: Module 3 of 6
Source material: Async Rust — Maxwell Flitton & Caroline Morton, Chapters 3, 6, 7, 8
Quiz pass threshold: 70% on all three lessons to unlock the project
- Mission Context
- What You Will Learn
- Lessons
- Capstone Project — Multi-Source Telemetry Aggregator
- Prerequisites
- What Comes Next
Mission Context
Module 2 built shared-state concurrency: Mutex, RwLock, atomics. Those primitives protect data that multiple actors need to touch. This module takes the complementary approach: instead of sharing data, pass ownership through channels. Producers and consumers are decoupled — each owns its state exclusively, communicating only through typed messages.
For the Meridian control plane, message passing is the primary architecture for the telemetry pipeline. 48 satellite uplinks funnel frames into a priority-ordered aggregator. TLE catalog updates fan out to every active session simultaneously. The shutdown signal propagates to all tasks through a watched value. None of these require shared mutable state — they compose entirely from channel primitives.
What You Will Learn
By the end of this module you will be able to:
- Create bounded
mpscchannels, size them for backpressure, clone senders for multiple concurrent producers, and design consumer loops that terminate cleanly when all senders drop - Implement the actor pattern: an async task that owns its state exclusively and exposes all operations as messages, using
oneshotchannels for request-response within the message protocol - Distribute events to all subscribers using
broadcast, handleRecvError::Laggedcorrectly, and size the broadcast capacity for the slowest realistic consumer - Distribute current state to many readers using
watch, understand the difference between event distribution and state distribution, and applyArc<T>inside a watch for cheap config reads - Merge multiple independent async sources into one stream using shared-sender MPSC (uniform sources),
select! { biased; }(priority sources), and a router actor (dynamic sources) - Choose between
mpsc,broadcast,watch, andoneshotgiven a fan-in or fan-out requirement
Lessons
Lesson 1 — tokio::mpsc: Bounded Channels, Backpressure, and Sender Cloning
Covers mpsc::channel(capacity), Sender::clone for multiple producers, send().await as the backpressure mechanism, try_send for non-blocking producers, the consumer loop termination on sender drop, oneshot for request-response, and the actor pattern as the structural idiom that emerges from MPSC channels.
Key question this lesson answers: How do you safely move work between concurrent async tasks without shared state, and what ensures slow consumers are not overwhelmed by fast producers?
→ lesson-01-mpsc.md / lesson-01-quiz.toml
Lesson 2 — Broadcast and Watch Channels: Fan-Out Patterns
Covers broadcast::channel(capacity) for event fan-out (every subscriber gets every message), RecvError::Lagged handling, watch::channel(initial) for state fan-out (latest value, change notification), borrow() for lock-free reads, and the decision matrix for choosing between mpsc, broadcast, and watch.
Key question this lesson answers: How do you distribute one event or one value to many concurrent tasks, and when does missing an intermediate update matter?
→ lesson-02-broadcast-watch.md / lesson-02-quiz.toml
Lesson 3 — Fan-In Aggregation: Merging Streams from Multiple Satellite Feeds
Covers shared-sender MPSC for uniform fan-in, select! { biased; } for priority fan-in with two priority levels, message tagging with typed source identifiers, and the router actor for dynamic fan-in (sources registered and removed at runtime).
Key question this lesson answers: How do you merge many independent async sources into one stream with control over priority, fairness, and dynamic source registration?
→ lesson-03-fan-in.md / lesson-03-quiz.toml
Capstone Project — Multi-Source Telemetry Aggregator
Build the full telemetry aggregation pipeline: a router actor with dynamic source registration, a priority fan-in that ensures emergency frames are never delayed behind routine telemetry, a bounded frame processor with backpressure, a broadcast fan-out to downstream consumers, atomic pipeline statistics exposed through a watch channel, and a clean shutdown sequence.
Acceptance is against 7 verifiable criteria including emergency frame priority, dynamic source registration, backpressure enforcement, lossless shutdown drain, and lagged broadcast handling.
→ project-telemetry-aggregator.md
Prerequisites
Modules 1 and 2 must be complete. Module 1 established how async tasks are scheduled and why they cooperatively yield — essential for understanding why bounded channel backpressure works without blocking threads. Module 2 established the shared-state model that message passing replaces — understanding both models is necessary to choose the right one for a given problem.
What Comes Next
Module 4 — Network Programming connects the message-passing pipeline to the network. The telemetry aggregator from this module gains a TCP listener front-end, turning the router actor into a full ground station connection broker that accepts connections from the 12 Meridian ground station sites.
Lesson 1 — tokio::mpsc: Bounded Channels, Backpressure, and Sender Cloning
Module: Foundation — M03: Message Passing Patterns
Position: Lesson 1 of 3
Source: Async Rust — Maxwell Flitton & Caroline Morton, Chapter 8
Context
Module 2's command queue used a Mutex<BinaryHeap> plus a Condvar to share state between threads. That approach works, but it couples the producers and consumer through a shared data structure — every access requires acquiring the same lock, and the consumer must hold the lock while inspecting queue contents. Under contention at 48-uplink load, that lock becomes a bottleneck.
The alternative model is message passing: producers send values into a channel; the consumer receives from it. There is no shared data structure, no explicit locking, and no Arc to pass around. The channel itself manages all synchronization. The backpressure mechanism is built in: when the channel is full, send yields rather than blocking a thread, and the async runtime can schedule other work while the producer waits.
This lesson covers tokio::sync::mpsc — the multi-producer, single-consumer channel that is the workhorse of most async Rust systems. It also covers oneshot for request-response patterns and introduces the actor model as the structural pattern that emerges naturally from combining channels with task ownership.
Source: Async Rust, Chapter 8 (Flitton & Morton)
Core Concepts
MPSC Channels: The Model
tokio::sync::mpsc::channel(capacity) creates a bounded channel and returns a (Sender<T>, Receiver<T>) pair. The capacity is the maximum number of messages that can sit in the channel before senders must wait:
use tokio::sync::mpsc; #[tokio::main] async fn main() { // Capacity of 32: up to 32 messages can be buffered. // If the receiver falls behind, the 33rd send will yield. let (tx, mut rx) = mpsc::channel::<String>(32); tokio::spawn(async move { tx.send("frame-001".to_string()).await.unwrap(); }); while let Some(msg) = rx.recv().await { println!("received: {msg}"); } }
recv() returns None when all Sender handles have been dropped — this is the clean shutdown signal for a consumer loop. No explicit close call is needed; the channel closes naturally when the last sender drops.
Sender is Clone: Multiple Producers
Sender<T> implements Clone. Each clone is an independent handle to the same channel. This is the "multi-producer" part of MPSC — any number of tasks can hold a Sender and push messages concurrently. The receiver sees messages from all senders interleaved, in the order they are delivered to the channel.
use tokio::sync::mpsc; #[tokio::main] async fn main() { let (tx, mut rx) = mpsc::channel::<(u32, Vec<u8>)>(64); // Each uplink session gets its own cloned sender. for satellite_id in 0..4u32 { let tx = tx.clone(); tokio::spawn(async move { for seq in 0u8..3 { let frame = vec![satellite_id as u8, seq]; // Yields if channel is full — backpressure in action. tx.send((satellite_id, frame)).await .expect("aggregator task dropped"); } }); } // Drop the original sender so the channel closes when all // spawned tasks finish. Without this drop, rx.recv() never // returns None — the original sender keeps the channel alive. drop(tx); while let Some((sat, frame)) = rx.recv().await { println!("sat {sat}: {:?}", frame); } }
The drop of the original tx after spawning is important and easy to forget. If any Sender clone outlives its usefulness, the channel stays open and the consumer loop blocks forever. The idiomatic pattern is to clone before spawning and drop the original.
Backpressure and Capacity Sizing
A bounded channel applies backpressure: when the channel reaches capacity, send().await yields and does not return until the consumer has drained a slot. This is the async equivalent of a blocking queue — it prevents fast producers from overwhelming a slow consumer.
try_send is the non-blocking variant. It returns Err(TrySendError::Full(_)) immediately if the channel is full rather than yielding. Use it when the producer should take an alternative action (log, drop, route to overflow) rather than applying backpressure:
use tokio::sync::mpsc; async fn forward_or_drop(tx: &mpsc::Sender<Vec<u8>>, frame: Vec<u8>) { match tx.try_send(frame) { Ok(()) => {} Err(mpsc::error::TrySendError::Full(frame)) => { // Aggregator is falling behind — record the drop and continue. // In production: increment a metrics counter here. tracing::warn!(bytes = frame.len(), "frame dropped: aggregator full"); } Err(mpsc::error::TrySendError::Closed(_)) => { tracing::error!("aggregator task has exited"); } } } #[tokio::main] async fn main() { let (tx, mut rx) = mpsc::channel::<Vec<u8>>(8); // Demonstrate try_send behaviour for i in 0u8..12 { forward_or_drop(&tx, vec![i]).await; } drop(tx); let mut count = 0; while rx.recv().await.is_some() { count += 1; } println!("received {count} frames (8 max due to capacity)"); }
Capacity sizing: too small causes unnecessary producer backpressure; too large hides a slow consumer until the buffer is exhausted. For the Meridian aggregator, a capacity of 2–4× the expected burst size is a reasonable starting point. Profile under realistic load.
unbounded_channel() provides no capacity limit — senders never yield. Use it only when backpressure is handled at an outer layer and unbounded buffering is acceptable (e.g., a metrics sink that can absorb any burst). Unbounded channels can cause OOM if the consumer is slower than the producers.
oneshot: Request-Response
tokio::sync::oneshot is a single-message channel: exactly one send, exactly one receive. It is the correct primitive for request-response patterns, where a task sends a request and needs to await the result:
use tokio::sync::{mpsc, oneshot}; enum ControlMsg { GetQueueDepth { reply: oneshot::Sender<usize> }, Flush, } async fn aggregator(mut rx: mpsc::Receiver<ControlMsg>) { let mut queue: Vec<Vec<u8>> = Vec::new(); while let Some(msg) = rx.recv().await { match msg { ControlMsg::GetQueueDepth { reply } => { // reply.send consumes the sender — can only respond once. let _ = reply.send(queue.len()); } ControlMsg::Flush => { println!("flushing {} frames", queue.len()); queue.clear(); } } } } #[tokio::main] async fn main() { let (tx, rx) = mpsc::channel::<ControlMsg>(8); tokio::spawn(aggregator(rx)); // Ask the aggregator for its current queue depth. let (reply_tx, reply_rx) = oneshot::channel::<usize>(); tx.send(ControlMsg::GetQueueDepth { reply: reply_tx }).await.unwrap(); let depth = reply_rx.await.unwrap(); println!("aggregator queue depth: {depth}"); }
The oneshot::Sender is embedded in the message itself. When the aggregator handles the message, it sends back through the oneshot and the caller's reply_rx.await resolves. This pattern — sometimes called the "mailbox" or "actor" pattern — eliminates the need for any shared state between the caller and the aggregator.
The Actor Pattern
An actor is an async task that owns its state exclusively and exposes its functionality entirely through message passing (Async Rust, Ch. 8). No locks, no shared Arc, no exposed fields. Every operation on the actor's state happens sequentially within the actor's message loop — concurrent safety is structural, not from locking.
The advantages: the actor's state is never accessed concurrently. There are no data races by construction. Testing is straightforward — send messages, check responses. Adding operations means adding enum variants, not adding lock guards.
The tradeoffs: all operations are async (each call involves a channel send and an await). If many callers need responses simultaneously, the actor is a serialization point. If the actor's work is CPU-intensive, it blocks its own message loop. Both are solvable — the first with multiple actors, the second with spawn_blocking inside the loop — but they require deliberate design.
Code Examples
Telemetry Frame Aggregator Actor
The aggregator is an actor: it owns the frame buffer exclusively, receives frames and control messages through a single channel, and responds to queries via embedded oneshot channels. No locks anywhere.
use tokio::sync::{mpsc, oneshot}; use std::collections::VecDeque; const MAX_BUFFER: usize = 1000; #[derive(Debug)] struct TelemetryFrame { satellite_id: u32, sequence: u64, payload: Vec<u8>, } enum AggregatorMsg { /// A new frame from an uplink session. Frame(TelemetryFrame), /// Request: how many frames are buffered? Depth { reply: oneshot::Sender<usize> }, /// Drain the buffer and return all frames. Drain { reply: oneshot::Sender<Vec<TelemetryFrame>> }, } async fn run_aggregator(mut rx: mpsc::Receiver<AggregatorMsg>) { let mut buffer: VecDeque<TelemetryFrame> = VecDeque::with_capacity(MAX_BUFFER); while let Some(msg) = rx.recv().await { match msg { AggregatorMsg::Frame(frame) => { if buffer.len() >= MAX_BUFFER { tracing::warn!( satellite_id = frame.satellite_id, "buffer full — dropping oldest frame" ); buffer.pop_front(); } buffer.push_back(frame); } AggregatorMsg::Depth { reply } => { let _ = reply.send(buffer.len()); } AggregatorMsg::Drain { reply } => { let frames: Vec<_> = buffer.drain(..).collect(); let _ = reply.send(frames); } } } tracing::info!("aggregator: all senders dropped, shutting down"); } /// A typed handle to the aggregator actor. /// Hides the channel internals from callers. #[derive(Clone)] struct AggregatorHandle { tx: mpsc::Sender<AggregatorMsg>, } impl AggregatorHandle { fn spawn(capacity: usize) -> Self { let (tx, rx) = mpsc::channel(capacity); tokio::spawn(run_aggregator(rx)); Self { tx } } async fn send_frame(&self, frame: TelemetryFrame) -> anyhow::Result<()> { self.tx.send(AggregatorMsg::Frame(frame)).await .map_err(|_| anyhow::anyhow!("aggregator has shut down")) } async fn depth(&self) -> anyhow::Result<usize> { let (reply_tx, reply_rx) = oneshot::channel(); self.tx.send(AggregatorMsg::Depth { reply: reply_tx }).await .map_err(|_| anyhow::anyhow!("aggregator has shut down"))?; reply_rx.await.map_err(|_| anyhow::anyhow!("aggregator dropped reply")) } async fn drain(&self) -> anyhow::Result<Vec<TelemetryFrame>> { let (reply_tx, reply_rx) = oneshot::channel(); self.tx.send(AggregatorMsg::Drain { reply: reply_tx }).await .map_err(|_| anyhow::anyhow!("aggregator has shut down"))?; reply_rx.await.map_err(|_| anyhow::anyhow!("aggregator dropped reply")) } } #[tokio::main] async fn main() -> anyhow::Result<()> { tracing_subscriber::fmt::init(); let agg = AggregatorHandle::spawn(128); // Simulate 4 concurrent uplink sessions each sending 3 frames. let tasks: Vec<_> = (0..4u32).map(|sat_id| { let agg = agg.clone(); tokio::spawn(async move { for seq in 0u64..3 { agg.send_frame(TelemetryFrame { satellite_id: sat_id, sequence: seq, payload: vec![sat_id as u8; 64], }).await.unwrap(); } }) }).collect(); for t in tasks { t.await.unwrap(); } println!("buffered: {}", agg.depth().await?); let frames = agg.drain().await?; println!("drained {} frames", frames.len()); Ok(()) }
The AggregatorHandle is the public API. Callers see send_frame, depth, and drain — they never interact with the channel directly. The handle is Clone, so it can be shared freely across tasks by cloning, with no Arc<Mutex<...>> needed.
Key Takeaways
-
tokio::sync::mpsc::channel(capacity)creates a bounded channel. The capacity is the backpressure valve:send().awaityields when the channel is full, preventing fast producers from overwhelming slow consumers. -
Sender<T>isClone. Every clone is an independent producer on the same channel. The channel closes when all senders drop. Always drop the original sender after spawning cloned senders, or the consumer loop will block forever. -
try_sendis the non-blocking variant. Use it when the producer should take an alternative action — drop, log, route to overflow — rather than yielding. Prefersend().awaitwhen backpressure is the correct response. -
oneshotis the single-message channel for request-response patterns. Embed theoneshot::Senderin the message to allow the receiver to reply exactly once. TheSenderis consumed on send — using it more than once is a compile error. -
The actor pattern — an async task that owns its state exclusively and receives all operations as messages — eliminates shared state and all associated locking. It is the structural pattern that emerges naturally from MPSC channels in async systems.
Lesson 2 — Broadcast and Watch Channels: Fan-Out Patterns for Telemetry Distribution
Module: Foundation — M03: Message Passing Patterns
Position: Lesson 2 of 3
Source: Async Rust — Maxwell Flitton & Caroline Morton, Chapters 6, 7
Context
MPSC channels move work from many producers to one consumer. The inverse problem is fan-out: distributing one event to many consumers. The Meridian control plane has two distinct fan-out requirements that call for different solutions.
The first: when a TLE catalog update arrives, every active uplink session needs to process it. Each session must see the update — no session should receive it twice, and no session should miss it. This is an event-distribution problem.
The second: the shutdown flag. Every task in the control plane needs to know when the system is shutting down, but they do not need to receive a separate "shutdown event" — they just need to be able to check the current value at any time. This is a state-distribution problem.
Tokio provides a dedicated primitive for each. broadcast solves event distribution: every subscriber receives every message. watch solves state distribution: subscribers observe the latest value and are notified when it changes.
Source: Async Rust, Chapters 6–7 (Flitton & Morton)
Core Concepts
tokio::sync::broadcast — Every Subscriber Gets Every Message
broadcast::channel(capacity) returns a (Sender<T>, Receiver<T>) pair. Additional receivers are created by calling sender.subscribe() — each receiver gets its own position in the channel and receives every message sent after the subscription point.
use tokio::sync::broadcast; #[tokio::main] async fn main() { let (tx, _rx) = broadcast::channel::<String>(16); // Each session gets its own receiver. let mut session_a = tx.subscribe(); let mut session_b = tx.subscribe(); tx.send("TLE-UPDATE-2024-001".to_string()).unwrap(); // Both sessions receive the same message independently. println!("A: {}", session_a.recv().await.unwrap()); println!("B: {}", session_b.recv().await.unwrap()); }
Sender::send does not require await — it is synchronous. Messages are placed in a ring buffer; receivers read from their own position in that buffer.
The Lagged Error and What to Do With It
The broadcast channel has a fixed capacity ring buffer. If a slow receiver falls behind by more than capacity messages, it loses its position in the buffer. The next recv() call returns Err(RecvError::Lagged(n)), where n is the number of messages missed.
This is not a fatal error. The receiver continues to work — it simply missed n messages and will receive all subsequent ones. Whether missing messages is acceptable depends on the use case. For TLE catalog updates, a session that missed 3 updates can request a fresh fetch. For an audit log, missing messages is a compliance issue.
#![allow(unused)] fn main() { use tokio::sync::broadcast; async fn session_loop(mut rx: broadcast::Receiver<Vec<u8>>) { loop { match rx.recv().await { Ok(frame) => { // Normal path. process_update(frame).await; } Err(broadcast::error::RecvError::Lagged(n)) => { // Receiver fell behind — n messages were lost from this receiver's view. // Log the gap and continue; the next recv will succeed. tracing::warn!(missed = n, "session fell behind broadcast — requesting resync"); request_catalog_resync().await; } Err(broadcast::error::RecvError::Closed) => { // All senders dropped — broadcast channel is done. tracing::info!("broadcast channel closed, session exiting"); break; } } } } async fn process_update(_frame: Vec<u8>) {} async fn request_catalog_resync() {} }
Capacity sizing for broadcast is more sensitive than for MPSC. The slowest subscriber determines whether lagging occurs. If subscribers have variable processing speeds, size the capacity to accommodate the slowest realistic consumer under load, plus a safety margin.
tokio::sync::watch — Latest Value, Change Notification
watch::channel(initial_value) creates a single-value channel: the sender can update the value at any time, and receivers are notified when it changes. Receivers always see the latest value; intermediate values may be missed if the sender updates faster than the receiver reads.
use tokio::sync::watch; #[tokio::main] async fn main() { let (tx, rx) = watch::channel::<bool>(false); // Clone the receiver for multiple tasks. let mut rx2 = rx.clone(); tokio::spawn(async move { // Wait for the value to change. rx2.changed().await.unwrap(); println!("shutdown signal received"); }); tokio::time::sleep(tokio::time::Duration::from_millis(10)).await; tx.send(true).unwrap(); tokio::time::sleep(tokio::time::Duration::from_millis(10)).await; }
watch::Receiver::borrow() returns the current value without waiting. changed().await waits for the next change and then lets you borrow() the new value. This is the pattern for config reloading: tasks watch for a config change, then read the new config with borrow().
watch is the correct primitive for the Meridian shutdown flag — much better than a broadcast channel. The shutdown event needs to be observed once by each task, and latecomers (tasks that check the flag after shutdown is signalled) need to see true immediately. A broadcast receiver created after the shutdown send would miss the message. A watch receiver always sees the current state.
Choosing Between mpsc, broadcast, and watch
| Pattern | Channel | Use when |
|---|---|---|
| Work queue: one item consumed once | mpsc | 48 sessions each send frames to one aggregator |
| Event broadcast: every subscriber gets every event | broadcast | TLE update delivered to all active sessions |
| State sync: subscribers need the latest value | watch | Shutdown flag, config updates, current orbital state |
| One-shot reply | oneshot | Request-response within an actor message |
The key question: does each message need to be consumed exactly once (mpsc), by every subscriber (broadcast), or is only the latest value relevant (watch)?
watch for Configuration Distribution
A common pattern in the Meridian control plane: runtime configuration loaded at startup and potentially reloaded via a management API. All tasks need to read the current config, and they need to be notified when it changes:
use tokio::sync::watch; use std::sync::Arc; #[derive(Clone, Debug)] struct ControlPlaneConfig { max_frame_size: usize, session_timeout_secs: u64, } async fn uplink_session( satellite_id: u32, mut config_rx: watch::Receiver<Arc<ControlPlaneConfig>>, ) { loop { // Read current config — no lock, no await. let config = config_rx.borrow().clone(); tokio::select! { // Process frames using current config. _ = tokio::time::sleep( tokio::time::Duration::from_secs(config.session_timeout_secs) ) => { tracing::warn!(satellite_id, "session timeout"); break; } // React to config changes mid-session. Ok(()) = config_rx.changed() => { let new_config = config_rx.borrow().clone(); tracing::info!( satellite_id, max_frame = new_config.max_frame_size, "config reloaded" ); // Loop continues with new config. } } } } #[tokio::main] async fn main() { let initial = Arc::new(ControlPlaneConfig { max_frame_size: 65536, session_timeout_secs: 600, }); let (config_tx, config_rx) = watch::channel(Arc::clone(&initial)); // Spawn a few sessions. for sat_id in 0..3u32 { let rx = config_rx.clone(); tokio::spawn(uplink_session(sat_id, rx)); } // Simulate a config reload. tokio::time::sleep(tokio::time::Duration::from_millis(50)).await; config_tx.send(Arc::new(ControlPlaneConfig { max_frame_size: 32768, session_timeout_secs: 300, })).unwrap(); tokio::time::sleep(tokio::time::Duration::from_millis(50)).await; }
Arc<Config> avoids cloning the full config struct on every borrow(). The Arc::clone is cheap (one atomic increment); the config data is shared read-only across tasks.
Code Examples
TLE Catalog Update Broadcaster
When the orbit data pipeline ingests a new TLE batch, it publishes the update over a broadcast channel. Every active session task receives the update and can refresh its orbital prediction model.
use tokio::sync::broadcast; use std::sync::Arc; #[derive(Clone, Debug)] struct TleUpdate { batch_id: u32, records: Arc<Vec<String>>, } async fn session_task( satellite_id: u32, mut tle_rx: broadcast::Receiver<TleUpdate>, shutdown_rx: tokio::sync::watch::Receiver<bool>, ) { let mut shutdown = shutdown_rx.clone(); loop { tokio::select! { result = tle_rx.recv() => { match result { Ok(update) => { tracing::info!( satellite_id, batch = update.batch_id, records = update.records.len(), "TLE update applied" ); } Err(broadcast::error::RecvError::Lagged(n)) => { tracing::warn!(satellite_id, missed = n, "TLE lag — resyncing"); } Err(broadcast::error::RecvError::Closed) => break, } } Ok(()) = shutdown.changed() => { if *shutdown.borrow() { break; } } } } tracing::info!(satellite_id, "session task exiting"); } #[tokio::main] async fn main() { tracing_subscriber::fmt::init(); let (tle_tx, _) = broadcast::channel::<TleUpdate>(32); let (shutdown_tx, shutdown_rx) = tokio::sync::watch::channel(false); // Spawn 4 sessions, each with its own broadcast receiver. for sat_id in 0..4u32 { let tle_rx = tle_tx.subscribe(); let sd = shutdown_rx.clone(); tokio::spawn(session_task(sat_id, tle_rx, sd)); } // Publish a TLE update to all sessions. tokio::time::sleep(tokio::time::Duration::from_millis(10)).await; tle_tx.send(TleUpdate { batch_id: 42, records: Arc::new(vec!["1 25544U...".to_string(); 100]), }).unwrap(); // Trigger shutdown. tokio::time::sleep(tokio::time::Duration::from_millis(20)).await; shutdown_tx.send(true).unwrap(); tokio::time::sleep(tokio::time::Duration::from_millis(20)).await; }
The combination of broadcast for events and watch for state is idiomatic Tokio. The broadcast channel delivers the catalog update to every session independently; the watch channel distributes the shutdown signal to all tasks simultaneously. The select! in the session loop races the two — whichever fires first wins.
Key Takeaways
-
broadcast::channel(capacity)distributes every message to every subscriber. Subscribers receive from their own position in a ring buffer. Creating a receiver viasender.subscribe()is the only way to subscribe — receivers created after a message is sent do not receive that message retroactively. -
RecvError::Lagged(n)is recoverable. A lagged receiver missednmessages but can continue receiving future ones. Whether missing messages is acceptable is application-specific; always handle it explicitly rather than treating it as a fatal error. -
watch::channel(initial)is for state distribution: the latest value, not every intermediate value.borrow()reads without waiting.changed().awaitwaits for the next update. Receivers created after a send see the current value immediately. -
Use
broadcastwhen every subscriber must receive every event. Usewatchwhen subscribers need the current state and can tolerate missing intermediate updates. Usempscwhen each message should be consumed by exactly one task. -
Arc<Config>wrapped in awatchchannel is the idiomatic pattern for distributing read-heavy configuration to many tasks. The watch notify is cheap; the config read is a lock-freeborrow().
Lesson 3 — Fan-In Aggregation: Merging Streams from Multiple Satellite Feeds
Module: Foundation — M03: Message Passing Patterns
Position: Lesson 3 of 3
Source: Async Rust — Maxwell Flitton & Caroline Morton, Chapters 3, 8
Context
Lesson 1 covered moving data from many producers to one consumer via MPSC. That is fan-in at its simplest: all producers push to the same channel. But the Meridian aggregator's real requirements are more demanding. The 48 uplink sessions produce at different rates. Archived replay feeds produce at a different priority level than live feeds. A session that goes silent should not block the aggregator from processing the other 47. A priority command frame from a SAFE_MODE event should not wait behind a queue of housekeeping frames.
These requirements call for structured fan-in: merging multiple independent async sources into one stream, with control over priority, fairness, and behaviour when sources are slow or silent. This lesson covers three fan-in patterns — shared-sender MPSC, select!-based merge with priority, and the router actor pattern — and when to use each.
Source: Async Rust, Chapters 3 & 8 (Flitton & Morton)
Core Concepts
Shared-Sender Fan-In: The Simple Case
The simplest fan-in is the one already established in Lesson 1: clone the Sender, give each producer a clone, and let the single Receiver consume them all. Every message enters the same queue; the consumer sees them in arrival order.
use tokio::sync::mpsc; async fn uplink_producer(satellite_id: u32, tx: mpsc::Sender<(u32, Vec<u8>)>) { for seq in 0u8..5 { let frame = vec![satellite_id as u8, seq]; if tx.send((satellite_id, frame)).await.is_err() { break; // Aggregator shut down. } } } #[tokio::main] async fn main() { let (tx, mut rx) = mpsc::channel::<(u32, Vec<u8>)>(256); for sat_id in 0..4u32 { tokio::spawn(uplink_producer(sat_id, tx.clone())); } drop(tx); while let Some((sat, frame)) = rx.recv().await { println!("sat {sat}: {:?}", frame); } }
This is correct and efficient for uniform, same-priority inputs. It has one limitation: arrival order provides no priority control. A SAFE_MODE frame from satellite 7 waits behind whatever housekeeping frames arrived first.
select!-Based Priority Fan-In
When sources have different priorities, select! can implement a priority receive by always checking a high-priority channel before a lower-priority one. Tokio's select! macro randomly selects among ready branches for fairness, but the biased modifier overrides this and evaluates branches in source order:
use tokio::sync::mpsc; async fn priority_aggregator( mut high: mpsc::Receiver<Vec<u8>>, mut low: mpsc::Receiver<Vec<u8>>, ) { loop { // biased: always check high-priority first. // Without biased, both channels are polled in random order — // low-priority frames could be dispatched before high-priority ones // if both are ready simultaneously. tokio::select! { biased; Some(frame) = high.recv() => { println!("HIGH: {} bytes", frame.len()); } Some(frame) = low.recv() => { println!("LOW: {} bytes", frame.len()); } else => break, } } } #[tokio::main] async fn main() { let (high_tx, high_rx) = mpsc::channel::<Vec<u8>>(64); let (low_tx, low_rx) = mpsc::channel::<Vec<u8>>(256); // High-priority: SAFE_MODE and emergency commands. tokio::spawn(async move { high_tx.send(vec![0xFF; 8]).await.unwrap(); // emergency frame }); // Low-priority: housekeeping telemetry. tokio::spawn(async move { for _ in 0..3 { low_tx.send(vec![0x00; 64]).await.unwrap(); } }); priority_aggregator(high_rx, low_rx).await; }
biased is important here. Without it, if both channels have messages ready, select! randomly picks which to process — a high-priority frame could wait behind three low-priority frames. With biased, the high-priority channel is always drained first. The tradeoff: if the high-priority channel receives messages faster than they are processed, the low-priority channel is starved. For mission-critical applications like SAFE_MODE injection, this is the intended behaviour.
This pattern directly implements what Async Rust Chapter 3 builds when constructing a priority async queue with HIGH_CHANNEL and LOW_CHANNEL — the concept is the same, applied to async channels rather than thread queues.
Tagging Messages with Source Identity
When fan-in merges undifferentiated Vec<u8> frames from multiple sources, the consumer cannot determine which satellite the frame came from. Tag messages at the producer side with an enum or a source identifier:
use tokio::sync::mpsc; #[derive(Debug)] enum FeedKind { LiveUplink { satellite_id: u32 }, ArchivedReplay { mission_id: String }, } #[derive(Debug)] struct TaggedFrame { source: FeedKind, sequence: u64, payload: Vec<u8>, } async fn live_uplink(sat_id: u32, tx: mpsc::Sender<TaggedFrame>) { for seq in 0u64..3 { let _ = tx.send(TaggedFrame { source: FeedKind::LiveUplink { satellite_id: sat_id }, sequence: seq, payload: vec![sat_id as u8; 32], }).await; } } async fn replay_feed(mission: String, tx: mpsc::Sender<TaggedFrame>) { for seq in 0u64..2 { let _ = tx.send(TaggedFrame { source: FeedKind::ArchivedReplay { mission_id: mission.clone() }, sequence: seq, payload: vec![0xAA; 128], }).await; } } #[tokio::main] async fn main() { let (tx, mut rx) = mpsc::channel::<TaggedFrame>(128); for sat_id in 0..3u32 { tokio::spawn(live_uplink(sat_id, tx.clone())); } tokio::spawn(replay_feed("ARTEMIS-IV".to_string(), tx.clone())); drop(tx); while let Some(frame) = rx.recv().await { match &frame.source { FeedKind::LiveUplink { satellite_id } => { println!("live sat {satellite_id} seq {}: {} bytes", frame.sequence, frame.payload.len()); } FeedKind::ArchivedReplay { mission_id } => { println!("replay {mission_id} seq {}: {} bytes", frame.sequence, frame.payload.len()); } } } }
Using an enum for source identity is more robust than a raw integer: the compiler enforces that all source types are handled. When a new source type is added, match exhaustiveness forces updates at all handling sites.
The Router Actor Pattern
For more than two or three sources, or when sources are created dynamically (e.g., a new ground station connection comes online mid-session), a router actor is the correct abstraction. The router owns a set of active input channels, polls them all, and forwards to a single output channel. This is the pattern Async Rust Chapter 8 builds as the foundation of its actor system.
use tokio::sync::mpsc; use std::collections::HashMap; #[derive(Debug)] struct TaggedFrame { source_id: u32, payload: Vec<u8>, } enum RouterMsg { /// Register a new uplink feed. AddFeed { source_id: u32, feed: mpsc::Receiver<Vec<u8>> }, /// Remove an uplink feed (session ended). RemoveFeed { source_id: u32 }, } async fn router_actor( mut ctrl: mpsc::Receiver<RouterMsg>, out: mpsc::Sender<TaggedFrame>, ) { // Tokio's mpsc doesn't provide a built-in multi-receiver select, // so we use a secondary MPSC where all feeds forward their frames. let (internal_tx, mut internal_rx) = mpsc::channel::<TaggedFrame>(512); let mut feed_handles: HashMap<u32, tokio::task::JoinHandle<()>> = HashMap::new(); loop { tokio::select! { // Control messages: add or remove feeds. Some(msg) = ctrl.recv() => { match msg { RouterMsg::AddFeed { source_id, mut feed } => { let fwd_tx = internal_tx.clone(); let handle = tokio::spawn(async move { while let Some(payload) = feed.recv().await { if fwd_tx.send(TaggedFrame { source_id, payload }).await.is_err() { break; // Router shut down. } } tracing::debug!(source_id, "feed task exiting"); }); feed_handles.insert(source_id, handle); } RouterMsg::RemoveFeed { source_id } => { if let Some(handle) = feed_handles.remove(&source_id) { handle.abort(); // Feed task no longer needed. } } } } // Frames from all registered feeds, already fan-in'ed via internal channel. Some(frame) = internal_rx.recv() => { if out.send(frame).await.is_err() { break; // Downstream consumer has shut down. } } else => break, } } } #[tokio::main] async fn main() { let (ctrl_tx, ctrl_rx) = mpsc::channel::<RouterMsg>(8); let (out_tx, mut out_rx) = mpsc::channel::<TaggedFrame>(256); tokio::spawn(router_actor(ctrl_rx, out_tx)); // Register two satellite feeds dynamically. for sat_id in [25544u32, 48274] { let (feed_tx, feed_rx) = mpsc::channel::<Vec<u8>>(32); ctrl_tx.send(RouterMsg::AddFeed { source_id: sat_id, feed: feed_rx, }).await.unwrap(); tokio::spawn(async move { for i in 0u8..3 { feed_tx.send(vec![i; 16]).await.unwrap(); } }); } drop(ctrl_tx); tokio::time::sleep(tokio::time::Duration::from_millis(50)).await; let mut count = 0; while let Ok(frame) = tokio::time::timeout( tokio::time::Duration::from_millis(20), out_rx.recv() ).await { if let Some(f) = frame { println!("sat {}: {} bytes", f.source_id, f.payload.len()); count += 1; } else { break; } } println!("total frames: {count}"); }
Each registered feed gets a dedicated forwarding task that moves frames to the router's internal channel. The router selects between control messages (add/remove feeds) and forwarded frames. Adding a new satellite source at runtime is a single ctrl_tx.send(RouterMsg::AddFeed {...}) call — no restructuring of the select loop.
Key Takeaways
-
Shared-sender MPSC is the simplest fan-in: all producers clone the
Sender, and the consumer reads from the singleReceiver. Use it when sources have equal priority and arrival order is acceptable. -
select!withbiasedimplements priority fan-in: the first branch is always evaluated before the second. Use it for two or three sources with different priority levels. Withoutbiased,select!randomizes branch selection — a high-priority source is not guaranteed to be drained first when both are ready. -
Tag messages at the source with a typed identifier (
enumor struct field) rather than relying on arrival order to infer provenance. An enum exhaustiveness check at the consumer forces all source types to be handled explicitly. -
The router actor pattern handles dynamic fan-in: sources can be registered and deregistered at runtime via control messages. Each source gets a dedicated forwarding task that converts its
Receiverinto tagged frames on the internal channel. The router selects between control and data messages. -
Fan-in and fan-out compose: an aggregator can receive from a router (fan-in) and forward to a broadcast channel (fan-out), building a full hub-and-spoke telemetry pipeline from these primitives.
Project — Multi-Source Telemetry Aggregator
Module: Foundation — M03: Message Passing Patterns
Prerequisite: All three module quizzes passed (≥70%)
- Mission Brief
- System Specification
- Expected Output
- Acceptance Criteria
- Hints
- Reference Implementation
- Reflection
Mission Brief
TO: Platform Engineering
FROM: Mission Control Systems Lead
CLASSIFICATION: UNCLASSIFIED // INTERNAL
SUBJECT: RFC-0047 — Telemetry Aggregation Pipeline
The control plane currently receives telemetry from 48 LEO satellite uplinks and two archived replay feeds simultaneously during mission replay operations. Each source produces frames at independent rates. Emergency command frames from any source must be processed before routine telemetry. Downstream analytics consumers need every frame; a monitoring dashboard needs only the latest pipeline statistics.
Your task is to build the telemetry aggregation pipeline that connects these sources to their consumers. The pipeline must: fan-in all sources into a priority-ordered stream, fan-out to a downstream frame processor and to a monitoring dashboard, apply backpressure so fast sources cannot overwhelm the processor, and shut down cleanly when signalled.
System Specification
Frame Types
#![allow(unused)] fn main() { #[derive(Debug, Clone)] pub enum FramePriority { Emergency, // SAFE_MODE, ABORT commands Routine, // Standard telemetry } #[derive(Debug, Clone)] pub struct Frame { pub source_id: u32, pub source_kind: SourceKind, pub priority: FramePriority, pub sequence: u64, pub payload: Vec<u8>, } #[derive(Debug, Clone)] pub enum SourceKind { LiveUplink, ArchivedReplay, } }
Pipeline Architecture
[Uplink 0..48] ──┐
├─► [Router Actor] ─► [Priority Fan-In] ─► [Frame Processor]
[Replay 0..2] ──┘ │ │
└──────────────────────────────► [Broadcast: all frames]
│
[Dashboard] [Archive]
[watch: shutdown] ──────────────────────────────────────► All tasks
[watch: stats] ◄──────────────────── Frame Processor (updates atomically)
Behavioural Requirements
Fan-in: Frames from live uplinks and archived replays are merged via a router actor that supports dynamic source registration. Emergency frames must be prioritised over routine frames when both are available simultaneously.
Backpressure: The frame processor has a bounded input channel (capacity 64). When the processor is saturated, backpressure propagates up to the priority fan-in, which in turn applies pressure to the router's internal channel. Routine sources are slowed; emergency frames still make progress due to priority ordering.
Fan-out: Every processed frame is sent over a broadcast channel to all downstream consumers. The monitoring dashboard subscribes; an archive writer task subscribes. The dashboard is allowed to lag and handles RecvError::Lagged gracefully.
Stats: The pipeline maintains three AtomicU64 counters: frames_routed, frames_processed, emergency_count. These are exposed via a watch channel as a PipelineStats snapshot, updated by the frame processor after each frame.
Shutdown: A watch<bool> shutdown signal is distributed to all tasks. On signal: (1) stop accepting new frames from sources, (2) drain the priority fan-in channel, (3) close the broadcast channel, (4) all tasks exit within 5 seconds.
Expected Output
A binary that:
- Starts a router actor accepting dynamic source registration
- Registers 4 live uplink sources (each sending 10 frames) and 1 replay source (sending 5 frames)
- 2 of the live uplink frames per source are marked
Emergency - Runs a frame processor that logs each frame with its priority and source
- Runs a monitoring task that reads
watch<PipelineStats>every 50ms and prints stats - Runs a downstream archive task subscribed to the broadcast channel
- Sends shutdown signal after all sources finish; all tasks exit cleanly
The output should clearly show emergency frames being processed before routine frames from the same batch.
Acceptance Criteria
| # | Criterion | Verifiable |
|---|---|---|
| 1 | Emergency frames processed before queued routine frames from the same source | Yes — log order |
| 2 | New sources can be registered at runtime via the router control channel | Yes — sources registered mid-run |
| 3 | Frame processor channel capacity is enforced — producers yield when full | Yes — add tokio::time::sleep in processor and verify producers do not drop frames |
| 4 | All downstream consumers receive every processed frame via broadcast | Yes — counts match between processor and archive consumer |
| 5 | Stats watch channel provides latest snapshot without acquiring any lock | Yes — code review: only atomic loads in stats read path |
| 6 | Shutdown drains the fan-in channel before exiting | Yes — no frames lost after shutdown signal |
| 7 | Lagged broadcast receivers log a warning and continue — they do not crash | Yes — introduce a slow archive task and verify Lagged is handled |
Hints
Hint 1 — Priority fan-in with biased select!
Use two channels from the router: one for emergency frames, one for routine. The priority fan-in selects with biased:
#![allow(unused)] fn main() { async fn priority_fan_in( mut emergency_rx: tokio::sync::mpsc::Receiver<Frame>, mut routine_rx: tokio::sync::mpsc::Receiver<Frame>, out_tx: tokio::sync::mpsc::Sender<Frame>, shutdown: tokio::sync::watch::Receiver<bool>, ) { let mut shutdown = shutdown; loop { tokio::select! { biased; Some(f) = emergency_rx.recv() => { if out_tx.send(f).await.is_err() { break; } } Some(f) = routine_rx.recv() => { if out_tx.send(f).await.is_err() { break; } } Ok(()) = shutdown.changed() => { if *shutdown.borrow() { break; } } else => break, } } } }
Hint 2 — Router actor with dynamic registration
The router forwards all sources to two internal channels split by priority. Each registered source gets a forwarding task:
#![allow(unused)] fn main() { enum RouterMsg { AddSource { source_id: u32, source_kind: SourceKind, feed: tokio::sync::mpsc::Receiver<Frame>, }, RemoveSource { source_id: u32 }, } }
The forwarding task reads from the feed and sends to the appropriate internal channel based on frame.priority.
Hint 3 — Stats snapshot with watch + atomics
The frame processor updates atomic counters after each frame, then sends a snapshot to the watch channel:
#![allow(unused)] fn main() { use std::sync::atomic::{AtomicU64, Ordering::Relaxed}; use std::sync::Arc; #[derive(Clone, Debug, Default)] pub struct PipelineStats { pub frames_processed: u64, pub emergency_count: u64, } struct StatsTracker { frames_processed: AtomicU64, emergency_count: AtomicU64, tx: tokio::sync::watch::Sender<PipelineStats>, } impl StatsTracker { fn record(&self, is_emergency: bool) { self.frames_processed.fetch_add(1, Relaxed); if is_emergency { self.emergency_count.fetch_add(1, Relaxed); } // Publish a snapshot — receivers always see the latest. let _ = self.tx.send(PipelineStats { frames_processed: self.frames_processed.load(Relaxed), emergency_count: self.emergency_count.load(Relaxed), }); } } }
Hint 4 — Broadcast fan-out with lagged handling
#![allow(unused)] fn main() { async fn archive_consumer( mut rx: tokio::sync::broadcast::Receiver<Frame>, ) { let mut archived = 0u64; loop { match rx.recv().await { Ok(frame) => { archived += 1; tracing::debug!( source = frame.source_id, seq = frame.sequence, "archived" ); } Err(tokio::sync::broadcast::error::RecvError::Lagged(n)) => { // Archive fell behind — note the gap and continue. tracing::warn!(missed = n, "archive lagged"); } Err(tokio::sync::broadcast::error::RecvError::Closed) => { tracing::info!(total = archived, "archive consumer done"); break; } } } } }
Reference Implementation
Reveal reference implementation
// This reference implementation is intentionally condensed. // A production implementation would split into modules. use tokio::sync::{broadcast, mpsc, watch}; use std::sync::atomic::{AtomicU64, Ordering::Relaxed}; use std::sync::Arc; use std::collections::HashMap; use tokio::time::{sleep, Duration}; #[derive(Debug, Clone)] pub enum FramePriority { Emergency, Routine } #[derive(Debug, Clone)] pub enum SourceKind { LiveUplink, ArchivedReplay } #[derive(Debug, Clone)] pub struct Frame { pub source_id: u32, pub source_kind: SourceKind, pub priority: FramePriority, pub sequence: u64, pub payload: Vec<u8>, } #[derive(Clone, Debug, Default)] pub struct PipelineStats { pub frames_processed: u64, pub emergency_count: u64, } enum RouterMsg { AddSource { source_id: u32, feed: mpsc::Receiver<Frame>, }, } async fn router_actor( mut ctrl: mpsc::Receiver<RouterMsg>, emergency_tx: mpsc::Sender<Frame>, routine_tx: mpsc::Sender<Frame>, ) { let (internal_tx, mut internal_rx) = mpsc::channel::<Frame>(512); let mut handles: HashMap<u32, tokio::task::JoinHandle<()>> = HashMap::new(); loop { tokio::select! { Some(msg) = ctrl.recv() => { match msg { RouterMsg::AddSource { source_id, mut feed } => { let fwd = internal_tx.clone(); let h = tokio::spawn(async move { while let Some(frame) = feed.recv().await { if fwd.send(frame).await.is_err() { break; } } }); handles.insert(source_id, h); } } } Some(frame) = internal_rx.recv() => { let dest = match frame.priority { FramePriority::Emergency => &emergency_tx, FramePriority::Routine => &routine_tx, }; if dest.send(frame).await.is_err() { break; } } else => break, } } } async fn priority_fan_in( mut emerg_rx: mpsc::Receiver<Frame>, mut routine_rx: mpsc::Receiver<Frame>, out_tx: mpsc::Sender<Frame>, mut shutdown: watch::Receiver<bool>, ) { loop { tokio::select! { biased; Some(f) = emerg_rx.recv() => { if out_tx.send(f).await.is_err() { break; } } Some(f) = routine_rx.recv() => { if out_tx.send(f).await.is_err() { break; } } Ok(()) = shutdown.changed() => { if *shutdown.borrow() { break; } } else => break, } } } async fn frame_processor( mut rx: mpsc::Receiver<Frame>, bcast_tx: broadcast::Sender<Frame>, stats_tx: watch::Sender<PipelineStats>, processed: Arc<AtomicU64>, emergency: Arc<AtomicU64>, ) { while let Some(frame) = rx.recv().await { let is_emerg = matches!(frame.priority, FramePriority::Emergency); tracing::info!( source = frame.source_id, seq = frame.sequence, priority = if is_emerg { "EMERGENCY" } else { "routine" }, "processed" ); processed.fetch_add(1, Relaxed); if is_emerg { emergency.fetch_add(1, Relaxed); } let _ = stats_tx.send(PipelineStats { frames_processed: processed.load(Relaxed), emergency_count: emergency.load(Relaxed), }); let _ = bcast_tx.send(frame); } } async fn archive_consumer(mut rx: broadcast::Receiver<Frame>) { let mut count = 0u64; loop { match rx.recv().await { Ok(_) => count += 1, Err(broadcast::error::RecvError::Lagged(n)) => { tracing::warn!(missed = n, "archive lagged"); } Err(broadcast::error::RecvError::Closed) => { tracing::info!(total = count, "archive done"); break; } } } } #[tokio::main] async fn main() { tracing_subscriber::fmt::init(); let (shutdown_tx, shutdown_rx) = watch::channel(false); let (stats_tx, mut stats_rx) = watch::channel(PipelineStats::default()); let (bcast_tx, _) = broadcast::channel::<Frame>(128); let (ctrl_tx, ctrl_rx) = mpsc::channel::<RouterMsg>(8); let (emerg_tx, emerg_rx) = mpsc::channel::<Frame>(64); let (routine_tx, routine_rx) = mpsc::channel::<Frame>(256); let (proc_tx, proc_rx) = mpsc::channel::<Frame>(64); let processed = Arc::new(AtomicU64::new(0)); let emergency = Arc::new(AtomicU64::new(0)); // Start pipeline tasks. tokio::spawn(router_actor(ctrl_rx, emerg_tx, routine_tx)); tokio::spawn(priority_fan_in(emerg_rx, routine_rx, proc_tx, shutdown_rx.clone())); tokio::spawn(frame_processor( proc_rx, bcast_tx.clone(), stats_tx, Arc::clone(&processed), Arc::clone(&emergency), )); tokio::spawn(archive_consumer(bcast_tx.subscribe())); // Register 4 live uplink sources. for sat_id in 0..4u32 { let (feed_tx, feed_rx) = mpsc::channel::<Frame>(32); ctrl_tx.send(RouterMsg::AddSource { source_id: sat_id, feed: feed_rx }).await.unwrap(); tokio::spawn(async move { for seq in 0u64..10 { let priority = if seq < 2 { FramePriority::Emergency } else { FramePriority::Routine }; feed_tx.send(Frame { source_id: sat_id, source_kind: SourceKind::LiveUplink, priority, sequence: seq, payload: vec![sat_id as u8; 32], }).await.unwrap(); sleep(Duration::from_millis(5)).await; } }); } // Stats monitor. tokio::spawn(async move { for _ in 0..4 { sleep(Duration::from_millis(50)).await; stats_rx.changed().await.ok(); let s = stats_rx.borrow().clone(); println!("stats: processed={} emergency={}", s.frames_processed, s.emergency_count); } }); sleep(Duration::from_millis(300)).await; println!("sending shutdown"); shutdown_tx.send(true).unwrap(); sleep(Duration::from_millis(100)).await; println!("final: processed={} emergency={}", processed.load(Relaxed), emergency.load(Relaxed)); }
Reflection
This project assembles the full message-passing toolkit from Module 3. The router actor provides dynamic fan-in with independent source lifecycle management. The priority fan-in ensures emergency frames are never delayed by routine traffic. The broadcast channel distributes every processed frame to all downstream consumers. The watch channel distributes state — shutdown signal and pipeline stats — without requiring consumers to hold any lock.
The pattern here — router → priority queue → processor → broadcast — recurs throughout Meridian's data pipeline architecture. In Module 4 (Network Programming), the router actor gains TCP listener integration, turning it into a full ground station connection broker.
Module 04 — Network Programming
Track: Foundation — Mission Control Platform
Position: Module 4 of 6
Source material: Tokio tutorial I/O and Framing chapters; reqwest documentation; tokio::net API docs
Quiz pass threshold: 70% on all three lessons to unlock the project
Note on source book: Network Programming with Rust (Chanda, 2018) uses pre-async/await Tokio 0.1 APIs that are incompatible with current Tokio 1.x. Lesson content is grounded in the current Tokio tutorial and API documentation rather than that book.
- Mission Context
- What You Will Learn
- Lessons
- Capstone Project — Ground Station Network Client
- Prerequisites
- What Comes Next
Mission Context
The Meridian control plane's telemetry pipeline now has a complete message-passing architecture (Module 3). What it still lacks is the network layer: the actual TCP connections from ground stations that feed the pipeline. This module builds that layer — connecting the abstract pipeline to the physical network.
The control plane operates three distinct network protocols simultaneously: persistent TCP sessions with ground stations (framed, long-lived, must reconnect on failure), UDP datagrams from SDA radar and optical sensors (high-frequency, latency-sensitive, loss-tolerant), and outbound HTTP calls to the TLE catalog API and mission operations endpoints (request-response, with retry logic).
What You Will Learn
By the end of this module you will be able to:
- Build async TCP servers with
tokio::net::TcpListener, spawn per-connection tasks, handle EOF correctly, and shut down the accept loop cleanly via awatchchannel shutdown signal - Use
AsyncReadExt::read_exactfor length-prefix framing, split sockets withTcpStream::split()andinto_split()for concurrent read/write, and wrap writers inBufWriterto reduce syscall overhead - Add per-session timeouts to detect silent connections (antenna tracking failures, network blackouts) without leaving ghost sessions open
- Bind and use
tokio::net::UdpSocketin both connected and unconnected modes, understand why UDP receive buffers must be sized to the maximum datagram, and applytry_sendrather than blocking in high-frequency sensor pipelines - Build a production
reqwest::Clientwith appropriate timeout configuration, share it viaCloneacross async tasks, useerror_for_status()correctly, and implement exponential backoff retry logic that distinguishes retryable server errors from non-retryable client errors
Lessons
Lesson 1 — TCP Servers with tokio::net: Listeners, Connection Handling, and Graceful Shutdown
Covers TcpListener::bind and the accept loop, AsyncReadExt/AsyncWriteExt extension traits, read_exact for framing, EOF handling, TcpStream::split() vs into_split(), BufWriter for write batching, read timeouts, and graceful shutdown of both the accept loop and individual connections.
Key question this lesson answers: How do you build a TCP server that handles many concurrent connections correctly — reading frames, handling EOF, splitting for bidirectional I/O, and shutting down cleanly?
→ lesson-01-tcp-servers.md / lesson-01-quiz.toml
Lesson 2 — UDP and Datagram Protocols: Low-Latency Sensor Data
Covers UdpSocket::bind, recv_from/send_to semantics, connected vs unconnected mode, concurrent send/receive via Arc<UdpSocket>, buffer sizing and IP fragmentation, OS socket buffer tuning with socket2, and the decision between UDP and TCP for high-frequency sensor pipelines.
Key question this lesson answers: When does UDP's lack of ordering and reliability become an advantage, and how do you structure a receiver that does not block on a slow downstream consumer?
→ lesson-02-udp.md / lesson-02-quiz.toml
Lesson 3 — HTTP Clients with reqwest: Async REST Calls
Covers reqwest::Client construction and sharing, ClientBuilder timeout configuration, error_for_status(), .json() for serialization/deserialization, retry logic with exponential backoff and jitter, status-code-based retry decisions, and multiple clients for services with different SLOs.
Key question this lesson answers: How do you build a robust HTTP client that handles transient failures without hammering a rate-limited API, and correctly distinguishes retryable errors from permanent ones?
→ lesson-03-http-clients.md / lesson-03-quiz.toml
Capstone Project — Ground Station Network Client
Build the full ground station client: connects to a TCP endpoint using the length-prefix framing protocol, automatically reconnects on failure with exponential backoff, runs a background TLE refresh via HTTP with retry logic, forwards received frames to the downstream aggregator pipeline via try_send, publishes session state via a watch channel, and shuts down cleanly including a GOODBYE frame to the peer.
Acceptance is against 7 verifiable criteria including automatic reconnection, bounded backoff, 5-minute failure timeout, TLE retry, non-blocking frame forwarding, mid-frame shutdown safety, and state machine correctness.
→ project-gs-client.md
Prerequisites
Modules 1–3 must be complete. Module 1 established the async task model and tokio::select! — both used extensively in connection handlers. Module 3 established the message-passing pipeline that network frames feed into. Understanding mpsc::Sender and try_send from Module 3 is prerequisite to the UDP and TCP lessons' discussion of non-blocking frame forwarding.
What Comes Next
Module 5 — Data-Oriented Design in Rust shifts from I/O to computation: how to lay out structs for CPU cache efficiency, when to use struct-of-arrays vs array-of-structs, and arena allocation for high-throughput frame processing. The telemetry frames arriving via the TCP and UDP clients from this module are processed in bulk in Module 5.
Lesson 1 — TCP Servers with tokio::net: Listeners, Connection Handling, and Graceful Shutdown
Module: Foundation — M04: Network Programming
Position: Lesson 1 of 3
Source: Tokio tutorial — I/O and Framing chapters (tokio.rs/tokio/tutorial)
Source note: Network Programming with Rust (Chanda) uses pre-async/await Tokio 0.1 APIs that are incompatible with current Tokio 1.x. This lesson is grounded in the current Tokio tutorial and Tokio 1.x API documentation.
Context
Every uplink session in the Meridian control plane begins with a TCP connection from a ground station. The Module 1 broker project sketched this accept loop in broad strokes. This lesson provides the complete model: how TcpListener binds and accepts connections, how to split a socket for concurrent read and write, how AsyncReadExt and AsyncWriteExt handle framed protocols, how a connection handler exits cleanly on EOF or error, and how the accept loop itself shuts down gracefully without leaking tasks.
The patterns here are not specific to Meridian. Every TCP server in Rust's async ecosystem — from a Redis clone to a satellite control plane — uses the same building blocks. Understanding them at the structural level means you can build, debug, and extend any such system.
Core Concepts
TcpListener — Binding and Accepting
tokio::net::TcpListener::bind(addr) binds the socket and returns a TcpListener. listener.accept().await waits for the next incoming connection and returns a (TcpStream, SocketAddr) pair. The accept call is async — while waiting, the executor can run other tasks.
use tokio::net::TcpListener; #[tokio::main] async fn main() -> anyhow::Result<()> { let listener = TcpListener::bind("0.0.0.0:7777").await?; loop { let (socket, addr) = listener.accept().await?; println!("connection from {addr}"); // Each connection gets its own task. tokio::spawn(async move { handle_connection(socket).await; }); } } async fn handle_connection(_socket: tokio::net::TcpStream) { // ... read frames, process, respond }
The accept loop spawns a new task per connection and immediately loops back to accept the next one. The connection handler runs concurrently with all other handlers and with the accept loop itself. This is the fundamental async TCP server structure.
One critical detail: if listener.accept() returns an error, it does not always mean the listener is broken. EAGAIN, ECONNABORTED, and similar transient errors should be logged and retried. An unrecoverable error (e.g., the listener fd was closed) should terminate the loop. A simple approach: log the error and continue — the OS will sort out transient errors. For a production-grade implementation, add an exponential backoff on repeated errors.
AsyncRead, AsyncWrite, and Their Extension Traits
tokio::net::TcpStream implements both AsyncRead and AsyncWrite, but you almost never call their methods directly. Instead you use the extension traits AsyncReadExt and AsyncWriteExt (from tokio::io), which provide ergonomic higher-level methods:
| Method | Description |
|---|---|
read(&mut buf) | Read up to buf.len() bytes; returns 0 on EOF |
read_exact(&mut buf) | Read exactly buf.len() bytes; errors on EOF |
read_u32(), read_u64(), etc. | Read a big-endian integer |
write_all(&buf) | Write all bytes in buf |
write_u32(n), etc. | Write a big-endian integer |
read_exact is the right primitive for fixed-size framing (like Meridian's 4-byte length prefix). It guarantees the buffer is fully populated before returning, handling the case where the underlying read returns fewer bytes than requested.
EOF handling: read() returning Ok(0) means the remote has closed the write half of the connection. Any subsequent read() will also return Ok(0). When you see this, exit the read loop — continuing to call read() on a closed stream creates a 100% CPU spin loop.
#![allow(unused)] fn main() { use tokio::io::AsyncReadExt; use tokio::net::TcpStream; async fn read_frame(stream: &mut TcpStream) -> anyhow::Result<Option<Vec<u8>>> { let mut len_buf = [0u8; 4]; // read_exact returns Err(UnexpectedEof) if the connection closes mid-header. match stream.read_exact(&mut len_buf).await { Ok(()) => {} Err(e) if e.kind() == std::io::ErrorKind::UnexpectedEof => { // Clean EOF at frame boundary — connection closed normally. return Ok(None); } Err(e) => return Err(e.into()), } let len = u32::from_be_bytes(len_buf) as usize; if len > 65_536 { anyhow::bail!("frame too large: {len} bytes"); } let mut payload = vec![0u8; len]; stream.read_exact(&mut payload).await?; Ok(Some(payload)) } }
Splitting a Socket: io::split and TcpStream::split
A TcpStream implements both AsyncRead and AsyncWrite, but Rust's borrow rules prevent passing &mut stream to two concurrent operations at the same time. To read and write concurrently — for example, to handle a bidirectional protocol or to send heartbeat responses while reading frames — the socket must be split.
TcpStream::split() splits by reference. Both halves must remain on the same task, but the read and write can be used independently within a single select! or sequential pair. Zero cost — no Arc, no Mutex.
io::split(stream) splits by value. Each half can be sent to a different task. Internally uses an Arc<Mutex> — slightly more overhead than the reference split, but needed when the read and write tasks must be truly independent.
#![allow(unused)] fn main() { use tokio::io::{AsyncReadExt, AsyncWriteExt}; use tokio::net::TcpStream; async fn bidirectional_handler(stream: TcpStream) -> anyhow::Result<()> { // into_split: value split — each half can move to separate tasks. let (mut reader, mut writer) = stream.into_split(); // Write task: sends periodic heartbeats. let write_task = tokio::spawn(async move { loop { tokio::time::sleep(tokio::time::Duration::from_secs(30)).await; if writer.write_all(b"HEARTBEAT\n").await.is_err() { break; } } }); // Read task: processes incoming frames. let mut buf = vec![0u8; 4096]; loop { let n = reader.read(&mut buf).await?; if n == 0 { break; } // EOF tracing::debug!(bytes = n, "frame received"); } write_task.abort(); Ok(()) } }
Use TcpStream::split() (reference) when both read and write stay in one task. Use TcpStream::into_split() (value) when they need to move to separate tasks.
BufWriter — Reducing Syscalls on the Write Path
Each write_all call is a syscall. For a protocol that sends many small writes (header bytes, then payload bytes), the overhead accumulates. Wrapping the write half in tokio::io::BufWriter buffers writes and flushes them in larger batches:
#![allow(unused)] fn main() { use tokio::io::{AsyncWriteExt, BufWriter}; use tokio::net::TcpStream; async fn write_framed(stream: TcpStream, payload: &[u8]) -> anyhow::Result<()> { // BufWriter with 8KB internal buffer — flushes when full or on explicit flush(). let mut writer = BufWriter::new(stream); // These two writes go to the internal buffer, not to the socket. let len = payload.len() as u32; writer.write_all(&len.to_be_bytes()).await?; writer.write_all(payload).await?; // flush() pushes the buffered bytes to the socket in one syscall. writer.flush().await?; Ok(()) } }
Always call flush() after writing a complete logical unit (a frame, a response). If you return from the handler without flushing, buffered data is silently dropped when the BufWriter drops.
Graceful Shutdown of the Accept Loop
A simple loop { listener.accept().await? } has no shutdown path. The pattern from Lesson 3 of Module 1 applies here: race the accept against a shutdown signal with select!:
#![allow(unused)] fn main() { use tokio::net::TcpListener; use tokio::sync::watch; async fn accept_loop( listener: TcpListener, mut shutdown: watch::Receiver<bool>, ) { loop { tokio::select! { accept = listener.accept() => { match accept { Ok((socket, addr)) => { tracing::info!(%addr, "connection accepted"); let sd = shutdown.clone(); tokio::spawn(async move { connection_handler(socket, sd).await; }); } Err(e) => { tracing::warn!("accept error: {e}"); // Continue — transient errors are normal. } } } Ok(()) = shutdown.changed() => { if *shutdown.borrow() { tracing::info!("accept loop shutting down"); break; } } } } } async fn connection_handler( _socket: tokio::net::TcpStream, _shutdown: watch::Receiver<bool>, ) { // Read frames; check shutdown between reads. } }
Pass the watch::Receiver into each connection handler so that individual connections can also respond to the shutdown signal — stopping mid-read cleanly rather than being forcibly dropped.
Code Examples
Production Ground Station TCP Server
A complete TCP server for a Meridian ground station connection. Reads length-prefixed frames, forwards them to the telemetry aggregator from Module 3, and shuts down cleanly.
use anyhow::Result; use tokio::{ io::{AsyncReadExt, AsyncWriteExt}, net::{TcpListener, TcpStream}, sync::{mpsc, watch}, time::{timeout, Duration}, }; use tracing::{info, warn}; #[derive(Debug)] struct TelemetryFrame { station_id: String, payload: Vec<u8>, } async fn read_frame(stream: &mut TcpStream) -> Result<Option<Vec<u8>>> { let mut len_buf = [0u8; 4]; match stream.read_exact(&mut len_buf).await { Ok(()) => {} Err(e) if e.kind() == std::io::ErrorKind::UnexpectedEof => return Ok(None), Err(e) => return Err(e.into()), } let len = u32::from_be_bytes(len_buf) as usize; if len > 65_536 { anyhow::bail!("frame too large: {len}"); } let mut buf = vec![0u8; len]; stream.read_exact(&mut buf).await?; Ok(Some(buf)) } async fn handle_connection( mut stream: TcpStream, station_id: String, frame_tx: mpsc::Sender<TelemetryFrame>, mut shutdown: watch::Receiver<bool>, ) { info!(station = %station_id, "session started"); loop { tokio::select! { // Bias toward reading to complete in-progress frames. biased; frame = timeout(Duration::from_secs(60), read_frame(&mut stream)) => { match frame { // Session timeout — ground station went silent. Err(_elapsed) => { warn!(station = %station_id, "session timeout"); break; } Ok(Ok(Some(payload))) => { if frame_tx.send(TelemetryFrame { station_id: station_id.clone(), payload, }).await.is_err() { break; // Aggregator shut down. } } Ok(Ok(None)) => { info!(station = %station_id, "connection closed by peer"); break; } Ok(Err(e)) => { warn!(station = %station_id, "read error: {e}"); break; } } } Ok(()) = shutdown.changed() => { if *shutdown.borrow() { info!(station = %station_id, "shutdown signal — closing session"); break; } } } } // Send a clean close to the peer. let _ = stream.shutdown().await; info!(station = %station_id, "session ended"); } pub async fn run_tcp_server( bind_addr: &str, frame_tx: mpsc::Sender<TelemetryFrame>, shutdown: watch::Receiver<bool>, ) -> Result<()> { let listener = TcpListener::bind(bind_addr).await?; info!("ground station server listening on {bind_addr}"); let mut conn_id = 0usize; let mut sd = shutdown.clone(); loop { tokio::select! { accept = listener.accept() => { let (socket, addr) = accept?; conn_id += 1; let station_id = format!("gs-{conn_id}@{addr}"); tokio::spawn(handle_connection( socket, station_id, frame_tx.clone(), shutdown.clone(), )); } Ok(()) = sd.changed() => { if *sd.borrow() { break; } } } } info!("accept loop exited"); Ok(()) } #[tokio::main] async fn main() -> Result<()> { tracing_subscriber::fmt::init(); let (frame_tx, mut frame_rx) = mpsc::channel::<TelemetryFrame>(256); let (shutdown_tx, shutdown_rx) = watch::channel(false); // Frame consumer. tokio::spawn(async move { while let Some(frame) = frame_rx.recv().await { info!(station = %frame.station_id, bytes = frame.payload.len(), "frame received"); } }); // Shutdown after 2 seconds for demo purposes. let sd = shutdown_tx; tokio::spawn(async move { tokio::time::sleep(Duration::from_secs(2)).await; let _ = sd.send(true); }); run_tcp_server("0.0.0.0:7777", frame_tx, shutdown_rx).await }
Several production decisions embedded here: the timeout around read_frame handles silent connections (antenna loss, network blackout) without leaving ghost sessions open. stream.shutdown() sends a TCP FIN to the peer on clean exit. The biased select! ensures an in-progress frame read is completed before the shutdown branch wins.
Key Takeaways
-
TcpListener::bind().awaitbinds the socket;listener.accept().awaityields a(TcpStream, SocketAddr). Spawn a task per connection and loop back immediately — the accept loop should never be blocked by connection handling. -
read()returningOk(0)is EOF — the remote closed its write half. Continuing to callread()after EOF creates a spin loop. Always exit the read loop onOk(0). -
read_exactis the correct primitive for fixed-size framing. It handles short reads internally and returnsUnexpectedEofif the connection closes before the buffer is filled. -
Use
TcpStream::split()for same-task read/write splitting (zero cost). UseTcpStream::into_split()when the read and write halves must move to separate tasks. -
BufWriterbatches small writes. Always callflush()after writing a complete logical unit — unflushed data is silently dropped when the writer drops. -
Add a
timeoutto reads in long-lived connections. Ground stations go silent without warning. A 60-second read timeout detects ghost sessions that would otherwise hold open resources indefinitely.
Lesson 2 — UDP and Datagram Protocols: Low-Latency Sensor Data
Module: Foundation — M04: Network Programming
Position: Lesson 2 of 3
Source: Synthesized from training knowledge and tokio::net::UdpSocket documentation
Source note: This lesson synthesizes from current
tokio::net::UdpSocketAPI documentation and training knowledge. The following concepts would benefit from verification against the source book if the API has changed:split()onUdpSocket,recv_from/send_tosemantics, andconnect()-vs-unconnected modes.
Context
The Meridian Space Domain Awareness network includes optical sensors and radar installations that report raw detection events at high frequency with strict latency requirements. A radar return needs to reach the conjunction analysis pipeline in under 50ms. At that latency budget, TCP's per-packet acknowledgment and retransmission overhead is a liability, not a feature. When the occasional dropped packet is acceptable — or when the application layer manages its own loss detection — UDP is the right transport.
UDP is a datagram protocol: each send and recv corresponds to exactly one discrete packet. There are no streams, no connection establishment, no ordering guarantees, and no retransmission. What you get is low overhead, minimal kernel buffering, and latency that is bounded only by the network, not by protocol machinery.
This lesson covers tokio::net::UdpSocket: binding, sending, receiving, splitting for concurrent send/receive, and the design decisions around UDP in a high-frequency sensor pipeline.
Core Concepts
UDP Socket Basics
UdpSocket::bind(addr) creates a UDP socket bound to a local address. Unlike TCP, there is no accept loop and no connection concept. A single bound socket can send to any address and receive from any address:
use tokio::net::UdpSocket; #[tokio::main] async fn main() -> anyhow::Result<()> { // Bind to receive on all interfaces, port 9090. let socket = UdpSocket::bind("0.0.0.0:9090").await?; let mut buf = [0u8; 1024]; loop { // recv_from returns the number of bytes and the sender's address. let (n, addr) = socket.recv_from(&mut buf).await?; println!("received {n} bytes from {addr}: {:?}", &buf[..n]); // Echo back. socket.send_to(&buf[..n], addr).await?; } }
recv_from waits for the next datagram. If the incoming datagram is larger than buf, the excess bytes are silently discarded — there is no partial read concept in UDP. Size your buffer to the maximum expected datagram, not the average.
Connected Mode vs. Unconnected Mode
An unconnected UDP socket can communicate with any remote address. A connected UDP socket is associated with one specific remote address via socket.connect(addr) — this is not a TCP handshake, just a filter on the local OS socket:
use tokio::net::UdpSocket; #[tokio::main] async fn main() -> anyhow::Result<()> { let socket = UdpSocket::bind("0.0.0.0:0").await?; // OS assigns port // "Connect" to the sensor — enables send/recv instead of send_to/recv_from. // Datagrams from other addresses are filtered out. socket.connect("192.168.1.100:5500").await?; socket.send(b"POLL").await?; let mut buf = [0u8; 256]; let n = socket.recv(&mut buf).await?; println!("sensor response: {:?}", &buf[..n]); Ok(()) }
After connect(), use send/recv instead of send_to/recv_from. The OS filters datagrams to only those from the connected address, which is useful for point-to-point sensor polling. For a server receiving from many sensors, use the unconnected mode with recv_from.
Splitting for Concurrent Send/Receive
A single UdpSocket cannot be both recv_from'd and send_to'd simultaneously from different tasks — you need a split. UdpSocket::into_split() returns (OwnedRecvHalf, OwnedSendHalf), each of which can be moved to a separate task:
use std::sync::Arc; use tokio::net::UdpSocket; #[tokio::main] async fn main() -> anyhow::Result<()> { let socket = Arc::new(UdpSocket::bind("0.0.0.0:9090").await?); // For UdpSocket, Arc-sharing is the idiomatic split pattern // because both send_to and recv_from take &self. let recv_socket = Arc::clone(&socket); let send_socket = Arc::clone(&socket); let recv_task = tokio::spawn(async move { let mut buf = [0u8; 1024]; loop { let (n, addr) = recv_socket.recv_from(&mut buf).await.unwrap(); println!("recv {n} bytes from {addr}"); } }); let send_task = tokio::spawn(async move { // Periodic heartbeat to a known sensor address. loop { tokio::time::sleep(tokio::time::Duration::from_secs(5)).await; send_socket.send_to(b"HEARTBEAT", "192.168.1.100:5500") .await.unwrap(); } }); let _ = tokio::join!(recv_task, send_task); Ok(()) }
UdpSocket's send_to and recv_from take &self (shared reference), so wrapping in Arc lets multiple tasks share the same socket without splitting. This differs from TcpStream where read and write require &mut self.
Buffer Sizing and Packet Loss
UDP datagrams have a maximum size of 65,507 bytes (for IPv4 over Ethernet), but practical limits are lower. A datagram that exceeds the network MTU (typically 1500 bytes on Ethernet) is fragmented at the IP layer. If any fragment is lost, the entire datagram is discarded. For high-frequency sensor data, keep individual datagrams under 1472 bytes (1500 MTU - 20 IP header - 8 UDP header) to avoid fragmentation.
Buffer the receive socket at the OS level with SO_RCVBUF if sensor bursts arrive faster than the application can drain them. This requires socket2 or nix crate access to set socket options before wrapping in tokio::net::UdpSocket:
#![allow(unused)] fn main() { use socket2::{Socket, Domain, Type}; use std::net::SocketAddr; use tokio::net::UdpSocket; async fn bind_with_large_buffer(addr: &str) -> anyhow::Result<UdpSocket> { let addr: SocketAddr = addr.parse()?; let socket = Socket::new(Domain::IPV4, Type::DGRAM, None)?; socket.set_reuse_address(true)?; // 4MB receive buffer to absorb radar bursts. socket.set_recv_buffer_size(4 * 1024 * 1024)?; socket.bind(&addr.into())?; socket.set_nonblocking(true)?; Ok(UdpSocket::from_std(socket.into())?) } }
When to Choose UDP over TCP
| Situation | Preferred |
|---|---|
| Radar/optical detection events, < 50ms latency budget | UDP |
| Telemetry frames requiring ordered delivery and reliability | TCP |
| Configuration commands — must not be lost | TCP |
| Periodic status heartbeats where loss is acceptable | UDP |
| Bulk TLE catalog transfer | TCP |
| High-frequency position updates where only latest matters | UDP |
The core tradeoff: TCP adds ordering, reliability, and flow control at the cost of latency and per-connection overhead. UDP provides a raw datagram channel — if reliability matters, implement it yourself (sequence numbers, ACKs, retransmission) at the application layer.
Code Examples
SDA Radar Sensor Receiver
The Meridian SDA network has radar stations that broadcast detection events as UDP datagrams. The receiver processes them and forwards to the conjunction analysis pipeline. Packet loss is tolerable — a missed radar return is worse than a delayed one, but the next sweep arrives in 250ms anyway.
use std::net::SocketAddr; use std::sync::Arc; use tokio::net::UdpSocket; use tokio::sync::mpsc; use tokio::time::{timeout, Duration}; #[derive(Debug)] struct RadarDetection { sensor_id: u32, azimuth_deg: f32, elevation_deg: f32, range_km: f32, timestamp_ms: u64, } fn parse_detection(buf: &[u8], addr: SocketAddr) -> Option<RadarDetection> { // Wire format: 4-byte sensor_id | 4-byte azimuth (f32 BE) | // 4-byte elevation (f32 BE) | 4-byte range (f32 BE) | // 8-byte timestamp (u64 BE) if buf.len() < 24 { return None; // Malformed datagram — discard silently. } let sensor_id = u32::from_be_bytes(buf[0..4].try_into().ok()?); let azimuth = f32::from_be_bytes(buf[4..8].try_into().ok()?); let elevation = f32::from_be_bytes(buf[8..12].try_into().ok()?); let range = f32::from_be_bytes(buf[12..16].try_into().ok()?); let timestamp = u64::from_be_bytes(buf[16..24].try_into().ok()?); tracing::debug!(%addr, sensor_id, "detection received"); Some(RadarDetection { sensor_id, azimuth_deg: azimuth, elevation_deg: elevation, range_km: range, timestamp_ms: timestamp, }) } async fn radar_receiver( bind_addr: &str, tx: mpsc::Sender<RadarDetection>, mut shutdown: tokio::sync::watch::Receiver<bool>, ) -> anyhow::Result<()> { let socket = Arc::new(UdpSocket::bind(bind_addr).await?); tracing::info!("radar receiver listening on {bind_addr}"); let mut buf = [0u8; 1472]; // Stay under MTU to avoid fragmentation. loop { tokio::select! { biased; recv = socket.recv_from(&mut buf) => { match recv { Ok((n, addr)) => { if let Some(detection) = parse_detection(&buf[..n], addr) { // Non-blocking — drop if pipeline is full rather than // blocking the receive loop. A queued radar sweep is // useless by the time it clears the backlog. if tx.try_send(detection).is_err() { tracing::warn!("detection pipeline full — datagram dropped"); } } } Err(e) => { tracing::warn!("recv error: {e}"); // UDP recv errors are typically transient — continue. } } } Ok(()) = shutdown.changed() => { if *shutdown.borrow() { tracing::info!("radar receiver shutting down"); break; } } } } Ok(()) } #[tokio::main] async fn main() -> anyhow::Result<()> { tracing_subscriber::fmt::init(); let (tx, mut rx) = mpsc::channel::<RadarDetection>(512); let (shutdown_tx, shutdown_rx) = tokio::sync::watch::channel(false); tokio::spawn(radar_receiver("0.0.0.0:9090", tx, shutdown_rx)); // Consumer: conjunction analysis pipeline. tokio::spawn(async move { while let Some(det) = rx.recv().await { tracing::info!( sensor = det.sensor_id, az = det.azimuth_deg, el = det.elevation_deg, range = det.range_km, "detection processed" ); } }); // Demo: shut down after 5 seconds. tokio::time::sleep(Duration::from_secs(5)).await; shutdown_tx.send(true)?; tokio::time::sleep(Duration::from_millis(100)).await; Ok(()) }
try_send instead of send().await is deliberate here. If the conjunction pipeline is saturated, blocking the radar receive loop means subsequent datagrams pile up in the OS socket buffer and eventually overflow it too. Dropping one detection and keeping the receive loop running is the correct behaviour for high-frequency sensor data where recency matters more than completeness.
Key Takeaways
-
UDP is a datagram protocol — each
send/recvis one discrete packet with no ordering, reliability, or congestion control. Use it when latency matters more than reliability, or when the application layer manages loss detection. -
recv_fromreturns the number of bytes received and the sender's address. If the datagram is larger than the buffer, excess bytes are silently discarded. Size receive buffers to the maximum expected datagram, not the average. -
connect()on a UDP socket is not a handshake — it sets a default remote address and filters incoming datagrams from other addresses. Use connected mode for point-to-point polling; use unconnected mode for servers receiving from many sources. -
UdpSocket'ssend_toandrecv_fromtake&self. Wrapping inArclets multiple tasks share one socket without a formal split — unlikeTcpStreamwhich requiresinto_split()orsplit()for concurrent access. -
Keep datagrams under 1472 bytes on Ethernet networks to avoid IP fragmentation. A single lost IP fragment drops the entire datagram.
-
In high-frequency sensor pipelines, use
try_sendrather thansend().awaitwhen forwarding to a downstream channel. Blocking the receive loop on a full channel is worse than dropping one datagram.
Lesson 3 — HTTP Clients with reqwest: Async REST Calls to Meridian's Mission API
Module: Foundation — M04: Network Programming
Position: Lesson 3 of 3
Source: Synthesized from reqwest documentation and training knowledge
Source note: This lesson synthesizes from
reqwest0.12.x API documentation and training knowledge. Verify connection pool configuration options against the currentreqwest::ClientBuilderdocs if behaviour differs.
Context
The Meridian control plane is not an island. It fetches TLE updates from the external Space-Track catalog API, posts conjunction alerts to the mission operations REST endpoint, and retrieves ground station configuration from an internal config service. All of these are HTTP calls — outbound, async, with retry logic and timeouts.
reqwest is the standard async HTTP client for Rust. It wraps hyper (the underlying HTTP implementation) with a high-level, ergonomic API, built-in connection pooling, JSON support through serde, and configurable timeout and retry behaviour. Understanding how to use it correctly — particularly how Client is shared, how connection pools work, and how to handle failures robustly — is essential for any Rust service that communicates with external APIs.
Core Concepts
Client — Shared, Pooled, Long-Lived
reqwest::Client manages a connection pool internally. Building a Client is expensive — it allocates the pool, sets up TLS configuration, and resolves DNS configuration. A Client is designed to be created once and cloned cheaply for sharing across tasks.
#![allow(unused)] fn main() { use reqwest::Client; use std::time::Duration; fn build_client() -> anyhow::Result<Client> { Ok(Client::builder() // Overall request timeout: connection + headers + body. .timeout(Duration::from_secs(30)) // How long to wait for the TCP connection to establish. .connect_timeout(Duration::from_secs(5)) // Keep connections alive for reuse — avoids TCP handshake per request. .pool_idle_timeout(Duration::from_secs(90)) .pool_max_idle_per_host(10) // User-Agent header for all requests. .user_agent("meridian-control-plane/1.0") .build()?) } }
Client is Clone — cloning it is a reference count increment that shares the same underlying connection pool. Pass a Client to tasks by cloning, not by wrapping in Arc<Mutex<Client>>. The Arc is already inside Client.
Never create a new Client per request. Each new Client is a new connection pool — you lose all the benefit of connection reuse and accumulate resource overhead proportional to your request rate.
Making Requests
The basic request pattern: call a method on the Client to get a RequestBuilder, add headers and body, call .send().await, check the status, and deserialize the response:
#![allow(unused)] fn main() { use reqwest::Client; use serde::{Deserialize, Serialize}; #[derive(Debug, Deserialize)] struct TleRecord { norad_id: u32, name: String, line1: String, line2: String, } async fn fetch_tle(client: &Client, norad_id: u32) -> anyhow::Result<TleRecord> { let url = format!("https://api.meridian.internal/tle/{norad_id}"); let response = client .get(&url) .header("X-API-Key", "mission-control-key") .send() .await?; // error_for_status() converts 4xx/5xx responses into Err. // Without this, a 404 or 500 is not an error — you receive the body. let response = response.error_for_status()?; let record: TleRecord = response.json().await?; Ok(record) } }
error_for_status() is important. A 404 or 503 does not cause .send().await to return Err — only network errors do. If you omit error_for_status(), a 500 response body is deserialized as if it were a valid TleRecord, producing a confusing JSON parse error rather than a clear HTTP error.
Sending JSON Bodies
For POST and PUT requests with JSON bodies, use .json(&value) on the RequestBuilder. It serializes the value with serde, sets the Content-Type: application/json header, and sets the body:
#![allow(unused)] fn main() { use reqwest::Client; use serde::Serialize; #[derive(Serialize)] struct ConjunctionAlert { object_a: u32, object_b: u32, tca_seconds: f64, miss_distance_km: f64, } async fn post_alert(client: &Client, alert: &ConjunctionAlert) -> anyhow::Result<()> { client .post("https://api.meridian.internal/alerts") .json(alert) .send() .await? .error_for_status()?; Ok(()) } }
.json() requires the json feature on reqwest (enabled by default). For large payloads that should be streamed rather than buffered in memory, use .body(reqwest::Body::wrap_stream(stream)) instead.
Retry Logic with Exponential Backoff
External APIs fail transiently — rate limits, brief outages, transient DNS failures. A single retry with a fixed delay is rarely sufficient. Exponential backoff with jitter is the standard approach: wait 1s, then 2s, then 4s, with random jitter to avoid thundering herds:
#![allow(unused)] fn main() { use reqwest::{Client, StatusCode}; use tokio::time::{sleep, Duration}; async fn fetch_with_retry( client: &Client, url: &str, max_attempts: u32, ) -> anyhow::Result<String> { let mut attempt = 0; loop { attempt += 1; let result = client.get(url).send().await; match result { Ok(resp) if resp.status().is_success() => { return Ok(resp.text().await?); } Ok(resp) if resp.status() == StatusCode::TOO_MANY_REQUESTS => { // Respect Retry-After header if present, otherwise backoff. let retry_after = resp .headers() .get("Retry-After") .and_then(|v| v.to_str().ok()) .and_then(|s| s.parse::<u64>().ok()) .unwrap_or(0); let delay = if retry_after > 0 { Duration::from_secs(retry_after) } else { backoff_delay(attempt) }; tracing::warn!(attempt, url, ?delay, "rate limited — backing off"); if attempt >= max_attempts { anyhow::bail!("rate limit exhausted"); } sleep(delay).await; } Ok(resp) if resp.status().is_server_error() => { tracing::warn!(attempt, url, status = %resp.status(), "server error"); if attempt >= max_attempts { anyhow::bail!("server error after {max_attempts} attempts"); } sleep(backoff_delay(attempt)).await; } Ok(resp) => { // 4xx client errors (except 429) are not retryable. anyhow::bail!("request failed: HTTP {}", resp.status()); } Err(e) if e.is_connect() || e.is_timeout() => { tracing::warn!(attempt, url, "network error: {e}"); if attempt >= max_attempts { return Err(e.into()); } sleep(backoff_delay(attempt)).await; } Err(e) => return Err(e.into()), } } } fn backoff_delay(attempt: u32) -> Duration { // Exponential backoff: 1s, 2s, 4s, 8s, capped at 30s. // Add jitter to avoid thundering herd. use std::time::SystemTime; let base = Duration::from_secs(1u64 << attempt.min(5)); let jitter_ms = (SystemTime::now() .duration_since(SystemTime::UNIX_EPOCH) .unwrap_or_default() .subsec_millis()) % 1000; base + Duration::from_millis(jitter_ms as u64) } }
Retry strategy by status code:
- 5xx (server error): Retry with backoff — transient server issues.
- 429 (too many requests): Retry with backoff, respect
Retry-Afterheader. - 408 (request timeout) or connection/timeout errors: Retry with backoff.
- 4xx (client errors) except 429: Do not retry — the request itself is malformed.
- Success: Return immediately.
Configuring Timeouts Correctly
A single .timeout(Duration) sets the overall request timeout (connection + sending + receiving). For fine-grained control:
#![allow(unused)] fn main() { use reqwest::Client; use std::time::Duration; fn build_production_client() -> anyhow::Result<Client> { Ok(Client::builder() // TCP connection timeout — fail fast if service is unreachable. .connect_timeout(Duration::from_secs(3)) // Total time budget for the entire request (all phases). .timeout(Duration::from_secs(15)) // How long an idle connection can sit in the pool before being closed. .pool_idle_timeout(Duration::from_secs(60)) .build()?) } }
For the Meridian TLE catalog API — a slow external service that can take up to 10 seconds to respond during load — set the timeout to 12–15 seconds. For the internal mission ops REST endpoint on the same datacenter network, 3–5 seconds is appropriate. Do not use the same Client configuration for both if the timeout requirements differ significantly — build two clients.
Code Examples
TLE Catalog HTTP Client for the Control Plane
The control plane fetches TLE updates from Space-Track on a 10-minute schedule. It also exposes a REST endpoint for on-demand TLE queries. This example shows both directions: fetching and posting, with retry logic and a shared client.
use anyhow::{Context, Result}; use reqwest::{Client, StatusCode}; use serde::{Deserialize, Serialize}; use std::time::Duration; use tokio::time::sleep; #[derive(Debug, Deserialize, Clone)] pub struct TleRecord { pub norad_id: u32, pub name: String, pub line1: String, pub line2: String, pub epoch: String, } #[derive(Debug, Serialize)] pub struct ConjunctionReport { pub object_a_id: u32, pub object_b_id: u32, pub tca_unix: f64, pub miss_distance_km: f64, pub probability: f64, } pub struct MissionApiClient { client: Client, base_url: String, api_key: String, } impl MissionApiClient { pub fn new(base_url: String, api_key: String) -> Result<Self> { let client = Client::builder() .connect_timeout(Duration::from_secs(5)) .timeout(Duration::from_secs(20)) .pool_max_idle_per_host(4) .user_agent("meridian-control-plane/1.0") .build() .context("failed to build HTTP client")?; Ok(Self { client, base_url, api_key }) } /// Fetch a single TLE record with up to 3 retry attempts. pub async fn get_tle(&self, norad_id: u32) -> Result<TleRecord> { let url = format!("{}/tle/{norad_id}", self.base_url); let mut attempt = 0u32; loop { attempt += 1; let response = self.client .get(&url) .header("X-API-Key", &self.api_key) .send() .await; match response { Ok(resp) if resp.status().is_success() => { return resp.json::<TleRecord>().await .context("failed to parse TLE response"); } Ok(resp) if resp.status().is_server_error() && attempt < 3 => { tracing::warn!(norad_id, attempt, status = %resp.status(), "retrying"); sleep(Duration::from_secs(1 << attempt)).await; } Ok(resp) => { anyhow::bail!("TLE fetch failed: HTTP {}", resp.status()); } Err(e) if (e.is_connect() || e.is_timeout()) && attempt < 3 => { tracing::warn!(norad_id, attempt, "network error: {e}, retrying"); sleep(Duration::from_secs(1 << attempt)).await; } Err(e) => return Err(e).context("TLE fetch network error"), } } } /// Post a conjunction report to the mission operations endpoint. pub async fn post_conjunction(&self, report: &ConjunctionReport) -> Result<()> { self.client .post(format!("{}/conjunctions", self.base_url)) .header("X-API-Key", &self.api_key) .json(report) .send() .await .context("failed to send conjunction report")? .error_for_status() .context("conjunction report rejected")?; Ok(()) } /// Fetch all active TLEs in a specified altitude band (batch request). pub async fn get_tle_batch(&self, min_km: u32, max_km: u32) -> Result<Vec<TleRecord>> { self.client .get(format!("{}/tle/batch", self.base_url)) .query(&[("min_alt_km", min_km), ("max_alt_km", max_km)]) .header("X-API-Key", &self.api_key) .send() .await? .error_for_status()? .json::<Vec<TleRecord>>() .await .context("failed to parse TLE batch response") } } #[tokio::main] async fn main() -> Result<()> { tracing_subscriber::fmt::init(); let api = MissionApiClient::new( "https://api.meridian.internal".to_string(), "mission-control-key".to_string(), )?; // Periodic TLE refresh loop. let api_ref = std::sync::Arc::new(api); let refresh_api = std::sync::Arc::clone(&api_ref); tokio::spawn(async move { loop { match refresh_api.get_tle(25544).await { Ok(tle) => tracing::info!(name = %tle.name, "TLE refreshed"), Err(e) => tracing::error!("TLE refresh failed: {e}"), } sleep(Duration::from_secs(600)).await; } }); // Post a conjunction report. api_ref.post_conjunction(&ConjunctionReport { object_a_id: 25544, object_b_id: 48274, tca_unix: 1_735_000_000.0, miss_distance_km: 0.8, probability: 0.003, }).await?; sleep(Duration::from_secs(1)).await; Ok(()) }
The MissionApiClient wraps the reqwest::Client and encodes the API contract — base URL, auth header, response types — in one place. Callers interact with typed methods rather than raw HTTP primitives. The Arc::new(api) pattern is appropriate here because Client is already internally reference-counted; wrapping in Arc just lets the MissionApiClient struct itself be shared. A simpler option is to pass &MissionApiClient to async functions directly, since MissionApiClient is Send + Sync.
Key Takeaways
-
Create one
Clientper configuration profile and share it across tasks viaClone. Each newClientis a new connection pool — creating one per request wastes connection setup overhead and defeats pooling. -
Always call
error_for_status()after.send().awaitunless you explicitly want to handle 4xx/5xx response bodies. HTTP error responses do not returnErrfromsend(). -
Use
.json(&value)for serializing request bodies with serde. Use.json::<T>()on the response for deserialization. Both require thejsonfeature (enabled by default). -
Distinguish retryable errors (5xx, 429, connection/timeout errors) from non-retryable ones (4xx client errors). Apply exponential backoff with jitter for retryable failures. Respect
Retry-Afterheaders on 429 responses. -
Set
connect_timeoutseparately from the overall.timeout. A short connect timeout (3–5s) fails fast on unreachable services without waiting for the full request timeout budget. -
For different external services with different latency profiles and rate limits, use separate
Clientinstances with separate configurations rather than sharing one client across everything.
Project — Ground Station Network Client
Module: Foundation — M04: Network Programming
Prerequisite: All three module quizzes passed (≥70%)
Mission Brief
TO: Platform Engineering
FROM: Mission Control Systems Lead
CLASSIFICATION: UNCLASSIFIED // INTERNAL
SUBJECT: RFC-0051 — Ground Station Network Client Implementation
The Meridian control plane currently uses a Python subprocess to manage ground station TCP connections. It provides no reconnection logic, no session health monitoring, and no integration with the TLE catalog API for per-session orbital data refresh. Under antenna tracking interruptions, sessions drop and are never re-established. Under Space-Track API rate limiting, TLE data becomes stale without any backoff or retry.
Your task is to build the ground station network client — the component that owns the full lifecycle of a ground station TCP session: connect, read frames, reconnect on failure, refresh TLE data via HTTP, and shut down cleanly.
System Specification
Connection Management
The client connects to a ground station TCP endpoint (host:port). The length-prefix framing protocol from Lesson 1 applies: 4-byte big-endian u32 length header followed by length bytes of payload.
On connection loss (EOF, read error, timeout), the client reconnects automatically with exponential backoff: 1s, 2s, 4s, 8s, up to 30s maximum. If reconnection fails for more than 5 minutes total, the client marks the station as Failed and stops retrying.
Session Lifecycle
Connecting → Connected → Receiving frames → [disconnect] → Reconnecting → Connected → ...
→ [shutdown signal] → Disconnecting → Stopped
→ [5 min failure] → Failed
The current session state is tracked as an enum and exposed via a watch channel so monitoring systems can observe it.
TLE Refresh
Each active session periodically fetches the TLE record for the session's assigned satellite from the mission API (GET /tle/{norad_id}). The refresh interval is configurable (default: 10 minutes). The HTTP client uses a connect_timeout of 3s and overall timeout of 15s. On 5xx or network errors, the refresh is retried with exponential backoff (up to 3 attempts). On 429, the backoff respects a Retry-After header if present.
Frame Forwarding
Successfully received frames are forwarded to a tokio::sync::mpsc::Sender<Frame>. The frame includes the station ID, the session's current TLE record (if available), and the raw payload. If the downstream channel is full, the frame is dropped and a warning is logged.
Shutdown
A watch::Receiver<bool> shutdown signal is accepted. On signal: complete the current frame read (do not abort mid-frame), flush any buffered writes (send a final GOODBYE frame to the peer), close the TCP connection cleanly, and exit.
Expected Output
A library crate (meridian-gs-client) with:
- A
GroundStationClientstruct withrun()method - A
SessionStateenum andwatchchannel for state observation - A
Framestruct forwarded to the downstream channel - A test binary that: connects to a local echo server (you implement a minimal echo server in the test), receives 5 frames, triggers reconnect by having the echo server drop the connection, verifies reconnection, then triggers shutdown
The test binary output should clearly show: initial connection, frame receipt, connection drop, reconnection, and clean shutdown.
Acceptance Criteria
| # | Criterion | Verifiable |
|---|---|---|
| 1 | Client reconnects automatically on connection loss with exponential backoff | Yes — drop the server connection and verify reconnection in logs |
| 2 | Reconnection backoff is bounded at 30 seconds | Yes — check timing between reconnect attempts under sustained failure |
| 3 | Client marks station as Failed after 5 minutes of failed reconnections | Yes — simulate sustained connection refusal and verify state transition |
| 4 | TLE refresh runs on the configured interval and retries on 5xx/network errors | Yes — mock server returning 503 then 200 |
| 5 | Frame forwarding uses try_send — channel-full does not block the receive loop | Yes — code review and test with a slow downstream consumer |
| 6 | Shutdown completes the current frame before exiting | Yes — send a large frame and trigger shutdown mid-send; frame arrives complete |
| 7 | Session state transitions are correctly published to the watch channel | Yes — observer task sees all transitions in order |
Hints
Hint 1 — Session state machine
#![allow(unused)] fn main() { #[derive(Debug, Clone, PartialEq)] pub enum SessionState { Connecting { attempt: u32 }, Connected { since: std::time::Instant }, Reconnecting { attempt: u32, next_retry: std::time::Instant }, Disconnecting, Failed { reason: String }, Stopped, } }
Publish state changes via watch::Sender<SessionState>. Observers call borrow() to read the current state or changed().await to wait for the next transition.
Hint 2 — Reconnect loop structure
#![allow(unused)] fn main() { async fn run_with_reconnect( config: &ClientConfig, tx: mpsc::Sender<Frame>, mut shutdown: watch::Receiver<bool>, state_tx: watch::Sender<SessionState>, ) { let mut attempt = 0u32; let start = std::time::Instant::now(); loop { if *shutdown.borrow() { break; } if start.elapsed() > std::time::Duration::from_secs(300) { let _ = state_tx.send(SessionState::Failed { reason: "reconnection window exceeded".into(), }); break; } let _ = state_tx.send(SessionState::Connecting { attempt }); match tokio::net::TcpStream::connect(&config.addr).await { Ok(stream) => { attempt = 0; // Reset on successful connection. let _ = state_tx.send(SessionState::Connected { since: std::time::Instant::now(), }); // Run the session until it disconnects or shutdown. run_session(stream, config, &tx, &mut shutdown, &state_tx).await; if *shutdown.borrow() { break; } } Err(e) => { tracing::warn!("connection failed (attempt {attempt}): {e}"); } } attempt += 1; let delay = std::time::Duration::from_secs((1u64 << attempt.min(5)).min(30)); let _ = state_tx.send(SessionState::Reconnecting { attempt, next_retry: std::time::Instant::now() + delay, }); tokio::time::sleep(delay).await; } } }
Hint 3 — TLE refresh as a background task per session
Spawn a TLE refresh task when the session connects. Abort it when the session disconnects. Use a watch::Sender<Option<TleRecord>> to share the current TLE with the frame handler:
#![allow(unused)] fn main() { async fn run_tle_refresh( http: reqwest::Client, norad_id: u32, interval: std::time::Duration, tle_tx: tokio::sync::watch::Sender<Option<TleRecord>>, mut shutdown: tokio::sync::watch::Receiver<bool>, ) { loop { tokio::select! { _ = tokio::time::sleep(interval) => { match fetch_tle_with_retry(&http, norad_id, 3).await { Ok(tle) => { let _ = tle_tx.send(Some(tle)); } Err(e) => tracing::warn!(norad_id, "TLE refresh failed: {e}"), } } Ok(()) = shutdown.changed() => { if *shutdown.borrow() { break; } } } } } }
Hint 4 — Sending a GOODBYE frame on clean shutdown
#![allow(unused)] fn main() { use tokio::io::AsyncWriteExt; async fn send_goodbye(stream: &mut tokio::net::TcpStream) { const GOODBYE: &[u8] = b"GOODBYE"; let len = (GOODBYE.len() as u32).to_be_bytes(); // Best-effort — ignore errors (we're shutting down anyway). let _ = stream.write_all(&len).await; let _ = stream.write_all(GOODBYE).await; let _ = stream.flush().await; let _ = stream.shutdown().await; } }
Reference Implementation
Reveal reference implementation
#![allow(unused)] fn main() { use anyhow::Result; use reqwest::Client as HttpClient; use serde::Deserialize; use std::time::{Duration, Instant}; use tokio::{ io::{AsyncReadExt, AsyncWriteExt}, net::TcpStream, sync::{mpsc, watch}, time::sleep, }; use tracing::{info, warn, error}; #[derive(Debug, Clone, Deserialize)] pub struct TleRecord { pub norad_id: u32, pub name: String, pub line1: String, pub line2: String, } #[derive(Debug, Clone, PartialEq)] pub enum SessionState { Connecting { attempt: u32 }, Connected, Reconnecting { attempt: u32 }, Failed { reason: String }, Stopped, } #[derive(Debug)] pub struct Frame { pub station_id: String, pub tle: Option<TleRecord>, pub payload: Vec<u8>, } pub struct ClientConfig { pub station_id: String, pub addr: String, pub norad_id: u32, pub api_base_url: String, pub tle_refresh_interval: Duration, } async fn read_frame(stream: &mut TcpStream) -> Result<Option<Vec<u8>>> { let mut len_buf = [0u8; 4]; match stream.read_exact(&mut len_buf).await { Ok(()) => {} Err(e) if e.kind() == std::io::ErrorKind::UnexpectedEof => return Ok(None), Err(e) => return Err(e.into()), } let len = u32::from_be_bytes(len_buf) as usize; if len > 65_536 { anyhow::bail!("frame too large: {len}"); } let mut buf = vec![0u8; len]; stream.read_exact(&mut buf).await?; Ok(Some(buf)) } async fn fetch_tle(http: &HttpClient, base_url: &str, norad_id: u32) -> Result<TleRecord> { let url = format!("{base_url}/tle/{norad_id}"); let mut attempt = 0u32; loop { attempt += 1; match http.get(&url).send().await { Ok(r) if r.status().is_success() => { return Ok(r.json::<TleRecord>().await?); } Ok(r) if r.status().is_server_error() && attempt < 3 => { warn!(norad_id, attempt, "TLE fetch server error, retrying"); sleep(Duration::from_secs(1 << attempt)).await; } Ok(r) => anyhow::bail!("TLE fetch: HTTP {}", r.status()), Err(e) if (e.is_connect() || e.is_timeout()) && attempt < 3 => { warn!(norad_id, attempt, "TLE fetch network error, retrying"); sleep(Duration::from_secs(1 << attempt)).await; } Err(e) => return Err(e.into()), } } } async fn run_session( mut stream: TcpStream, config: &ClientConfig, http: &HttpClient, frame_tx: &mpsc::Sender<Frame>, tle_tx: &watch::Sender<Option<TleRecord>>, mut shutdown: watch::Receiver<bool>, ) { // Kick off TLE refresh task for this session. let (session_shutdown_tx, session_shutdown_rx) = watch::channel(false); let tle_refresh = { let http = http.clone(); let base = config.api_base_url.clone(); let norad_id = config.norad_id; let interval = config.tle_refresh_interval; let tle_tx = tle_tx.clone(); tokio::spawn(async move { let mut sd = session_shutdown_rx; loop { tokio::select! { _ = sleep(interval) => { match fetch_tle(&http, &base, norad_id).await { Ok(tle) => { let _ = tle_tx.send(Some(tle)); } Err(e) => warn!("TLE refresh failed: {e}"), } } Ok(()) = sd.changed() => { if *sd.borrow() { break; } } } } }) }; loop { tokio::select! { biased; frame = tokio::time::timeout(Duration::from_secs(60), read_frame(&mut stream)) => { match frame { Err(_) => { warn!(station = %config.station_id, "session timeout"); break; } Ok(Ok(Some(payload))) => { let tle = tle_tx.subscribe().borrow().clone(); let f = Frame { station_id: config.station_id.clone(), tle, payload, }; if frame_tx.try_send(f).is_err() { warn!(station = %config.station_id, "frame dropped: pipeline full"); } } Ok(Ok(None)) => { info!(station = %config.station_id, "peer closed connection"); break; } Ok(Err(e)) => { warn!(station = %config.station_id, "read error: {e}"); break; } } } Ok(()) = shutdown.changed() => { if *shutdown.borrow() { info!(station = %config.station_id, "shutdown — sending GOODBYE"); let _ = session_shutdown_tx.send(true); let payload = b"GOODBYE"; let len = (payload.len() as u32).to_be_bytes(); let _ = stream.write_all(&len).await; let _ = stream.write_all(payload).await; let _ = stream.flush().await; let _ = stream.shutdown().await; break; } } } } let _ = session_shutdown_tx.send(true); let _ = tle_refresh.await; } pub async fn run_client( config: ClientConfig, frame_tx: mpsc::Sender<Frame>, mut shutdown: watch::Receiver<bool>, state_tx: watch::Sender<SessionState>, ) { let http = HttpClient::builder() .connect_timeout(Duration::from_secs(3)) .timeout(Duration::from_secs(15)) .build() .expect("failed to build HTTP client"); let (tle_tx, _) = watch::channel::<Option<TleRecord>>(None); let mut attempt = 0u32; let start = Instant::now(); loop { if *shutdown.borrow() { break; } if start.elapsed() > Duration::from_secs(300) { let _ = state_tx.send(SessionState::Failed { reason: "5-minute reconnect window exceeded".into(), }); return; } let _ = state_tx.send(SessionState::Connecting { attempt }); match TcpStream::connect(&config.addr).await { Ok(stream) => { attempt = 0; let _ = state_tx.send(SessionState::Connected); info!(station = %config.station_id, "connected to {}", config.addr); run_session(stream, &config, &http, &frame_tx, &tle_tx, shutdown.clone()).await; if *shutdown.borrow() { break; } info!(station = %config.station_id, "session ended, will reconnect"); } Err(e) => { warn!(station = %config.station_id, attempt, "connection failed: {e}"); } } attempt += 1; let delay = Duration::from_secs((1u64 << attempt.min(5)).min(30)); let _ = state_tx.send(SessionState::Reconnecting { attempt }); info!(station = %config.station_id, "reconnecting in {delay:?}"); sleep(delay).await; } let _ = state_tx.send(SessionState::Stopped); info!(station = %config.station_id, "client stopped"); } }
Reflection
The ground station client built here is the connection layer that sits between the raw TCP socket and the telemetry aggregator from Module 3. The three lessons of this module are directly combined: TcpListener/TcpStream from Lesson 1 for the framed session protocol, UDP from Lesson 2 could be added for out-of-band sensor feeds from the same station, and reqwest from Lesson 3 for the TLE refresh background task within the session.
The reconnection loop pattern — state machine published to a watch channel, exponential backoff, failure timeout — is universal. It applies equally to database connections, message broker connections, and any other persistent network resource that needs supervisory recovery behaviour.
Module 05 — Data-Oriented Design in Rust
Track: Foundation — Mission Control Platform
Position: Module 5 of 6
Source material: Rust for Rustaceans — Jon Gjengset, Chapters 2, 9
Quiz pass threshold: 70% on all three lessons to unlock the project
- Mission Context
- What You Will Learn
- Lessons
- Capstone Project — High-Throughput Telemetry Packet Processor
- Prerequisites
- What Comes Next
Mission Context
The Meridian telemetry processor runs at 62,000 frames per second. The conjunction avoidance pipeline requires 100,000. The gap is not a missing algorithm or a suboptimal data structure — it is allocator pressure and cache waste, both caused by data layout decisions made when defining types. Each frame allocates a Vec<u8> on the global heap. Each deduplication pass loads 2.4× more data than the deduplication logic uses.
Data-oriented design is a discipline for making data layout decisions that align with CPU hardware realities: cache lines are 64 bytes, cache misses cost 100–300 cycles, and SIMD instructions operate on contiguous uniform-type data. The three techniques in this module — cache-optimal struct layout, SoA field separation, and arena allocation — directly address the two profiling findings above.
What You Will Learn
By the end of this module you will be able to:
- Explain how field alignment and padding inflate struct sizes, use
reprattributes to control layout, and writeconstassertions to lock in size expectations at compile time - Identify false sharing between concurrent tasks, apply
repr(align(64))with padding to isolate per-thread data to separate cache lines, and separate hot fields from cold fields in structs used in high-volume collections - Explain when SoA layout outperforms AoS (field-subset sequential operations) and when AoS outperforms SoA (per-entity random access), implement an
OrbitalCatalogusing field grouping, and transition from AoS to SoA incrementally without a full rewrite - Implement a bump/arena allocator for same-lifetime batch allocations, contrast its allocation cost with the global allocator, use thread-local arenas for zero-contention concurrent allocation, and identify when arena allocation is inappropriate (mixed lifetimes, individual deallocation)
Lessons
Lesson 1 — Cache-Friendly Data Layouts: Struct Layout, Padding, and Cache Line Alignment
Covers alignment and padding mechanics, repr(C) vs repr(Rust) vs repr(packed) vs repr(align(n)), false sharing between concurrent tasks, repr(align(64)) for per-thread counter isolation, and hot/cold field separation. Grounded in Rust for Rustaceans, Chapter 2.
Key question this lesson answers: How does field order affect struct size, what causes false sharing between concurrent tasks, and how do you isolate hot data from cold data?
→ lesson-01-cache-friendly-layouts.md / lesson-01-quiz.toml
Lesson 2 — Struct-of-Arrays vs Array-of-Structs: When Each Wins
Covers the AoS and SoA layout patterns, the cache utilisation argument for each, the conditions that favour SoA (field-subset sequential scans, SIMD, large N), the conditions that favour AoS (per-entity random access, all-field operations), the hybrid field-grouping pattern, and incremental AoS-to-SoA transition via a companion index.
Key question this lesson answers: When does splitting fields into separate vectors improve performance, and when does it hurt?
→ lesson-02-soa-vs-aos.md / lesson-02-quiz.toml
Lesson 3 — Arena Allocation: Bump Allocators for High-Throughput Telemetry Processing
Covers the global allocator's cost for high-frequency short-lived allocations, the bump allocator pattern (O(1) alloc, O(1) epoch free), the lifetime constraint, thread-local arenas for zero-contention concurrent allocation, the bumpalo crate interface, and the workloads where arena allocation is inappropriate.
Key question this lesson answers: When is the global allocator the bottleneck, and how does a bump allocator eliminate that overhead for same-lifetime batch objects?
→ lesson-03-arena-allocation.md / lesson-03-quiz.toml
Capstone Project — High-Throughput Telemetry Packet Processor
Rebuild the Meridian telemetry processor core to achieve ≥100,000 frames/sec using all three techniques: a 24-byte FrameHeader with const size assertion, SoA separation of headers from arena-allocated payloads, bump arena for batch payload allocation with O(1) epoch reset, and SoA-based deduplication operating only on the hot header array.
Acceptance is against 7 verifiable criteria including compile-time size assertions, no per-frame heap allocations, correct arena reset, and measured throughput.
→ project-telemetry-processor.md
Prerequisites
Modules 1–4 must be complete. Module 2 (Concurrency Primitives) introduced atomic operations and the false sharing problem — Lesson 1 of this module extends that with the repr(align(64)) solution. Module 5's content stands alone otherwise; it does not build on the networking or message-passing material from Modules 3–4.
What Comes Next
Module 6 — Performance and Profiling gives you the measurement tools to validate the optimisations introduced here: criterion for reliable microbenchmarks, flamegraph and perf for identifying hot paths, and heap profiling for measuring allocator pressure. You will profile the processor built in this module's project and verify the improvement against the baseline.
Lesson 1 — Cache-Friendly Data Layouts: Struct Layout, Padding, and Cache Line Alignment
Module: Foundation — M05: Data-Oriented Design in Rust
Position: Lesson 1 of 3
Source: Rust for Rustaceans — Jon Gjengset, Chapter 2
Context
The Meridian telemetry processor receives 100,000 frames per second at peak load across 48 uplinks. Each frame passes through validation, deduplication, and routing — operations that read specific fields from a TelemetryFrame struct on every iteration. At that throughput, the cost of a CPU cache miss — roughly 100–300 clock cycles to fetch from RAM, compared to 4 cycles for an L1 cache hit — is the difference between keeping up and falling behind.
CPU cache performance is not something you can bolt on after profiling shows a bottleneck. It is determined by the decisions you make when you define your data types. How fields are ordered. How large structs are. Whether hot fields and cold fields share a cache line. These decisions are locked in by the struct definition, and changing them later requires touching every callsite that constructs or accesses the type.
This lesson covers the mechanics that determine how Rust lays out your types in memory, the repr attributes that control those mechanics, and how to make decisions that keep hot data in cache.
Source: Rust for Rustaceans, Chapter 2 (Gjengset)
Core Concepts
Alignment and Padding
Every type has an alignment requirement — the CPU needs its address to be a multiple of some power of two. A u8 needs 1-byte alignment. A u32 needs 4-byte alignment. A u64 needs 8-byte alignment.
When you put fields of different alignments in a struct, the compiler inserts padding bytes between fields to satisfy alignment requirements (Rust for Rustaceans, Ch. 2). Consider this struct with #[repr(C)] (which preserves field order):
#![allow(unused)] fn main() { #[repr(C)] struct BadLayout { tiny: bool, // 1 byte // 3 bytes padding — to align `normal` to 4 bytes normal: u32, // 4 bytes small: u8, // 1 byte // 7 bytes padding — to align `long` to 8 bytes long: u64, // 8 bytes short: u16, // 2 bytes // 6 bytes padding — to make total size a multiple of alignment (8) } // Total: 32 bytes. Actual data: 16 bytes. Wasted: 16 bytes — half the struct is padding. }
With #[repr(Rust)] (the default), the compiler is free to reorder fields by size, descending — eliminating most padding:
#![allow(unused)] fn main() { // Default Rust layout — compiler reorders fields for minimal padding. // Effective order: long (8), normal (4), short (2), tiny (1), small (1) struct GoodLayout { tiny: bool, normal: u32, small: u8, long: u64, short: u16, } // Total: 16 bytes. Same fields, no wasted padding. }
The difference at scale: a Vec<BadLayout> of 1 million elements occupies 32 MB. A Vec<GoodLayout> with the same data occupies 16 MB — fitting twice as many elements in the same cache footprint, doubling cache hit rate for sequential access.
You can verify sizes at compile time with std::mem::size_of:
#[repr(C)] struct BadLayout { tiny: bool, normal: u32, small: u8, long: u64, short: u16 } struct GoodLayout { tiny: bool, normal: u32, small: u8, long: u64, short: u16 } fn main() { // Confirm the size difference at compile time. const _: () = assert!(std::mem::size_of::<BadLayout>() == 32); const _: () = assert!(std::mem::size_of::<GoodLayout>() == 16); println!("BadLayout: {} bytes", std::mem::size_of::<BadLayout>()); println!("GoodLayout: {} bytes", std::mem::size_of::<GoodLayout>()); }
Use const assertions as compile-time guards on struct sizes for types that appear in high-volume collections. When a future change accidentally adds padding, the assertion fails at compile time rather than silently degrading cache performance.
repr Attributes
repr(Rust) — the default. The compiler may reorder fields for minimal padding and does not guarantee a specific layout. This is optimal for Rust-only code but incompatible with C interop.
repr(C) — fields laid out in declaration order, C-compatible. Required when passing structs across FFI boundaries. At the cost of potentially more padding if fields are not ordered by descending alignment.
repr(packed) — removes all padding. Fields may be misaligned, which can be much slower on x86 (unaligned loads trigger microcode assists) and cause bus errors on architectures that require alignment. Use only when minimizing memory footprint is more important than access speed — for example, serialized wire formats, or extremely memory-constrained environments.
repr(align(n)) — forces the struct to have at least n byte alignment. The most common use in systems programming is cache line alignment for concurrent data structures:
#![allow(unused)] fn main() { use std::sync::atomic::AtomicU64; // Each counter occupies a full 64-byte cache line. // Without this: two counters from different threads share a cache line, // causing false sharing — each write by one thread invalidates the // other thread's cache entry even though they touch different data. #[repr(align(64))] struct CacheAlignedCounter { value: AtomicU64, _pad: [u8; 56], // Explicit padding to fill the 64-byte cache line. } }
Cache Lines and False Sharing
A CPU cache line is 64 bytes on x86-64. The CPU fetches and evicts cache lines as atomic units — not individual bytes or words. When two logical pieces of data share a cache line, any write to either one invalidates the entire line in every other core's cache.
False sharing occurs when two threads write to different variables that happen to occupy the same cache line (Rust for Rustaceans, Ch. 2). Each write by either thread causes the cache line to bounce between cores — effectively serializing what should be independent writes:
#![allow(unused)] fn main() { use std::sync::atomic::{AtomicU64, Ordering}; use std::thread; // BAD: both counters fit in one 16-byte struct, sharing a cache line. // Thread A's writes to `a` invalidate thread B's cached copy of the line, // which also contains `b`. Both threads contend on the same cache line. struct SharedCounters { a: AtomicU64, b: AtomicU64, } // GOOD: each counter on its own cache line. #[repr(align(64))] struct IsolatedCounter { value: AtomicU64, } fn demonstrate_false_sharing() { // With SharedCounters: threads A and B writing independently // still cause cache coherence traffic because they share a line. // With two IsolatedCounter instances: threads write truly independently. let counter_a = IsolatedCounter { value: AtomicU64::new(0) }; let counter_b = IsolatedCounter { value: AtomicU64::new(0) }; // counter_a and counter_b now occupy separate cache lines. // Writes by one thread do not invalidate the other's cache entry. counter_a.value.fetch_add(1, Ordering::Relaxed); counter_b.value.fetch_add(1, Ordering::Relaxed); } }
Hot Field / Cold Field Separation
Not all fields in a struct are accessed with equal frequency. For a TelemetryFrame, the routing fields (satellite_id, sequence) are read on every frame. The full payload is only read when forwarding downstream. Putting hot and cold data in the same struct means every cache miss for a hot field also loads the cold payload into cache — evicting other useful data.
The pattern: split the struct. Keep a hot "header" struct with frequently accessed fields, and access the cold data via an Arc<Vec<u8>> or a separate index:
#![allow(unused)] fn main() { use std::sync::Arc; // Hot: accessed on every frame for routing decisions. // 24 bytes — fits comfortably in cache alongside many sibling headers. struct FrameHeader { satellite_id: u32, // 4 bytes sequence: u64, // 8 bytes timestamp_ms: u64, // 8 bytes flags: u8, // 1 byte _pad: [u8; 3], // 3 bytes padding (explicit, documented) } // Cold: accessed only when forwarding to downstream consumers. // Heap-allocated; not loaded until needed. struct FrameBody { header: FrameHeader, payload: Arc<Vec<u8>>, // Heap allocation keeps cold data out of hot path. } }
A Vec<FrameHeader> for routing decisions keeps 24-byte hot entries packed. 64 bytes (one cache line) holds 2 full headers plus change — much better than loading 24 + payload.len() bytes per frame just to check a sequence number.
Code Examples
Verifying Layout Decisions at Compile Time
Use constant assertions to lock in size expectations for hot types. This catches accidental regressions — adding a field that introduces padding shows up as a compile error immediately.
use std::mem::{size_of, align_of}; /// A telemetry frame header optimized for sequential scanning. /// Fields ordered by alignment (descending) to minimize padding. #[derive(Debug, Clone, Copy)] pub struct TelemetryHeader { pub timestamp_ms: u64, // 8 bytes — largest alignment first pub sequence: u64, // 8 bytes pub satellite_id: u32, // 4 bytes pub byte_count: u32, // 4 bytes pub flags: u8, // 1 byte pub station_id: u8, // 1 byte pub _reserved: [u8; 2], // 2 bytes explicit pad — documented intent } // Lock in the expected size at compile time. // If a future change causes unexpected padding, this fails to compile. const _SIZE_CHECK: () = assert!(size_of::<TelemetryHeader>() == 32); const _ALIGN_CHECK: () = assert!(align_of::<TelemetryHeader>() == 8); /// A per-uplink session counter, cache-line aligned to prevent false sharing. /// 48 sessions each updating their own counter never contend on a shared line. #[repr(align(64))] pub struct SessionCounter { pub frames_received: u64, pub bytes_received: u64, pub frames_dropped: u64, _pad: [u8; 40], // Pad to fill 64-byte cache line: 3×8 + 40 = 64. } const _COUNTER_ALIGN: () = assert!(align_of::<SessionCounter>() == 64); const _COUNTER_SIZE: () = assert!(size_of::<SessionCounter>() == 64); fn main() { println!("TelemetryHeader: {} bytes", size_of::<TelemetryHeader>()); println!("SessionCounter: {} bytes (cache-line aligned)", size_of::<SessionCounter>()); // Verify that an array of counters places each on its own cache line. let counters: Vec<SessionCounter> = (0..4) .map(|_| SessionCounter { frames_received: 0, bytes_received: 0, frames_dropped: 0, _pad: [0; 40], }) .collect(); // Each counter is 64 bytes and 64-byte aligned — no false sharing. for (i, c) in counters.iter().enumerate() { let addr = c as *const _ as usize; println!("counter[{i}] at 0x{addr:x} (aligned: {})", addr % 64 == 0); } }
Key Takeaways
-
The compiler inserts padding between fields to satisfy alignment requirements. Field order determines how much padding is inserted. Ordering fields by decreasing size (largest alignment first) minimizes padding with default
repr(Rust). -
repr(Rust)(default) allows the compiler to reorder fields — usually optimal.repr(C)preserves field order for FFI compatibility at the potential cost of more padding.repr(packed)removes padding but risks misaligned access penalties. -
repr(align(n))forces a minimum alignment. Use it to ensure hot atomic counters occupy separate cache lines when accessed from multiple threads concurrently, preventing false sharing. -
False sharing occurs when two threads write to different variables that share a 64-byte cache line. The fix is
repr(align(64))with explicit padding to fill the cache line. -
Separate hot fields (read on every iteration) from cold fields (read rarely). A struct that bundles both forces the CPU to load cold data into cache on every hot access, evicting more useful data. Use a header struct for hot fields and heap-allocated or indexed access for cold data.
-
Use
constassertions onsize_ofandalign_offor types in high-volume collections. They turn accidental layout regressions into compile errors rather than silent performance degradation.
Lesson 2 — Struct-of-Arrays vs Array-of-Structs: When Each Wins
Module: Foundation — M05: Data-Oriented Design in Rust Position: Lesson 2 of 3 Source: Synthesized from training knowledge. Concepts would benefit from verification against: Mike Acton's CppCon 2014 "Data-Oriented Design and C++" and Chandler Carruth's related talks.
Context
The Meridian conjunction screening pass processes 50,000 orbital elements every 10 minutes. Each screening step reads the altitude and inclination of every object in the catalog. It does not read the object name, the launch date, the operator contact, or any other administrative metadata. Those fields exist in the catalog, but the screening loop does not touch them.
With a conventional struct design — one OrbitalObject struct with all fields — the screening loop loads each full struct into cache on every iteration. The fields it actually uses are 16 bytes; the fields it ignores are perhaps 120 bytes. The ratio: 13% of every cache line brought in is useful data. The rest is wasted memory bandwidth.
This is the core insight behind struct-of-arrays (SoA) layout: if an operation only accesses a subset of fields, store those fields contiguously rather than interleaved with irrelevant fields. Processing altitudes[0..50000] accesses only altitude data; there is no orbital metadata in the working set, no cache line waste.
This lesson covers when SoA beats AoS, when AoS beats SoA, and how to implement the transition idiomatically in Rust.
Core Concepts
Array-of-Structs (AoS): The Default
The conventional layout: a Vec<T> where T is a struct containing all fields for one entity.
#![allow(unused)] fn main() { // Array-of-Structs: all fields for one object are contiguous. #[derive(Debug, Clone)] struct OrbitalObject { norad_id: u32, // 4 bytes altitude_km: f64, // 8 bytes — used in conjunction screening inclination_deg: f64, // 8 bytes — used in conjunction screening raan_deg: f64, // 8 bytes — used in conjunction screening name: [u8; 24], // 24 bytes — NOT used in conjunction screening launch_year: u16, // 2 bytes — NOT used in conjunction screening _pad: [u8; 6], // 6 bytes padding } // One OrbitalObject: 64 bytes. One cache line. // The screening loop uses 28 bytes (norad_id + 3 doubles). // 36 bytes of every cache line are irrelevant to screening. fn screen_conjunctions_aos(objects: &[OrbitalObject], threshold_km: f64) -> Vec<u32> { let mut alerts = Vec::new(); for (i, a) in objects.iter().enumerate() { for b in &objects[i+1..] { // Each iteration loads a full 64-byte OrbitalObject into cache. // Only altitude_km and inclination_deg are used. let delta = (a.altitude_km - b.altitude_km).abs(); if delta < threshold_km { alerts.push(a.norad_id); } } } alerts } }
For a 50,000-object catalog, AoS loads 50,000 × 64 bytes = 3.2 MB per pass, even though only 28 bytes per object matter. At a 64-byte cache line, 44% of cache bandwidth is wasted on unused fields.
Struct-of-Arrays (SoA): Fields in Separate Vectors
SoA inverts the layout: instead of one Vec<Object>, maintain separate Vec<field_type> for each field. Objects are indexed by position across all vectors.
#![allow(unused)] fn main() { // Struct-of-Arrays: each field is a contiguous array. // Access patterns that touch only a few fields see only those fields in cache. struct OrbitalCatalog { // "Hot" fields — accessed every screening pass. norad_ids: Vec<u32>, altitudes_km: Vec<f64>, inclinations_deg: Vec<f64>, raans_deg: Vec<f64>, // "Cold" fields — accessed only for display / export. names: Vec<[u8; 24]>, launch_years: Vec<u16>, } impl OrbitalCatalog { fn len(&self) -> usize { self.norad_ids.len() } fn push(&mut self, id: u32, alt: f64, inc: f64, raan: f64, name: [u8; 24], launch: u16) { self.norad_ids.push(id); self.altitudes_km.push(alt); self.inclinations_deg.push(inc); self.raans_deg.push(raan); self.names.push(name); self.launch_years.push(launch); } } fn screen_conjunctions_soa(catalog: &OrbitalCatalog, threshold_km: f64) -> Vec<u32> { let alts = &catalog.altitudes_km; let ids = &catalog.norad_ids; let mut alerts = Vec::new(); for i in 0..catalog.len() { for j in i+1..catalog.len() { // Only altitudes_km is touched here — 8 bytes per element. // 8 f64s fit in one cache line. // For 50k objects, working set for altitudes_km = 400KB (fits in L2). let delta = (alts[i] - alts[j]).abs(); if delta < threshold_km { alerts.push(ids[i]); } } } alerts } }
The screening loop now accesses only altitudes_km — 50,000 × 8 bytes = 400 KB, which fits in a typical L2 cache (512KB–2MB). The names, launch years, and RAAN values are never loaded. Cache utilisation is near 100%.
When SoA Wins
SoA is most effective when:
-
Operations access a small subset of fields. The conjunction screening loop uses 3 of 6 fields. SIMD vectorization of
altitudes_km - alts[j]operates on 8 doubles per instruction with AVX2. -
Processing is sequential over all objects. Iterating
altitudes_km[0..50000]is a linear scan — the hardware prefetcher predicts the access pattern and pre-fetches cache lines ahead of the loop. -
Field values have uniform types amenable to SIMD. A
Vec<f64>can be processed withf64x4orf64x8SIMD instructions. An AoS loop cannot be auto-vectorized as efficiently because the fields are interleaved. -
Objects are added and removed infrequently. SoA requires synchronized insertion and removal across all vectors. Random insertion in the middle is O(n) for every field vector simultaneously.
When AoS Wins
AoS is more appropriate when:
-
Operations access all or most fields of one object at a time. Constructing a display record or serializing one object reads all fields — SoA forces jumping across multiple vectors.
-
Access is random by index. Looking up object ID 25544 requires one index lookup across all vectors. AoS keeps all of 25544's data in one cache line — one miss. SoA scatters it across multiple cache lines.
-
Objects are frequently inserted, removed, or moved. AoS insertion is a single
push. SoA insertion ispushacross all field vectors — more work and more cache lines touched. -
The struct has few fields or all fields are typically accessed together. If the struct is small (≤ 32 bytes) and all fields are used in every operation, SoA provides no benefit and complicates the API.
Hybrid: AoS with Field Grouping
The practical approach is not a binary AoS vs SoA choice — it is grouping fields by access pattern:
#![allow(unused)] fn main() { /// Hot group: fields accessed in every pass of the screening loop. #[derive(Debug, Clone, Copy)] struct ObjectHot { altitude_km: f64, inclination_deg: f64, raan_deg: f64, eccentricity: f64, } /// Cold group: fields accessed for display, export, and audit only. #[derive(Debug, Clone)] struct ObjectCold { norad_id: u32, launch_year: u16, name: String, operator: String, } /// The catalog splits hot and cold data into separate vectors. /// The index is the common key between them. struct OrbitalCatalog { hot: Vec<ObjectHot>, // Dense; accessed every screening pass. cold: Vec<ObjectCold>, // Sparse access; not in screening hot path. } }
The screening pass operates only on hot — a Vec<ObjectHot> of 50,000 × 32 bytes = 1.6 MB, fitting in L3 cache. cold is loaded only for display or export, where its access pattern (one object at a time by index) makes AoS natural.
Code Examples
Parallel Altitude Screening with Rayon and SoA
With altitudes stored in a contiguous Vec<f64>, the screening loop is naturally parallelisable — each parallel chunk accesses an independent range of the altitude slice:
// Note: rayon is not available in the Playground; this demonstrates the // pattern. In production, add rayon = "1" to Cargo.toml. /// Simplified O(n) altitude band screening (not the full O(n²) conjunction check). /// Finds objects in a dangerous altitude band. SoA makes this trivially parallel /// and cache-friendly: the working set is just Vec<f64>. fn screen_altitude_band( altitudes_km: &[f64], norad_ids: &[u32], min_km: f64, max_km: f64, ) -> Vec<u32> { assert_eq!(altitudes_km.len(), norad_ids.len()); // Sequential: all altitudes fit in one contiguous slice. // Hardware prefetcher maximises cache utilisation. altitudes_km .iter() .zip(norad_ids.iter()) .filter_map(|(&alt, &id)| { if alt >= min_km && alt <= max_km { Some(id) } else { None } }) .collect() } fn main() { // Simulate a 10,000-object catalog. let altitudes_km: Vec<f64> = (0..10_000u32) .map(|i| 350.0 + (i as f64) * 0.05) .collect(); let norad_ids: Vec<u32> = (0..10_000u32).collect(); // Screen for objects in the 400–450 km band (high debris density). let alerts = screen_altitude_band(&altitudes_km, &norad_ids, 400.0, 450.0); println!("{} objects in 400–450km band", alerts.len()); // Verify the working set is contiguous and predictable: let working_set_bytes = altitudes_km.len() * std::mem::size_of::<f64>(); println!("working set: {} KB", working_set_bytes / 1024); }
Transposing AoS to SoA Incrementally
Transitioning an existing AoS codebase to SoA does not require a full rewrite. Extract the hot fields into a companion SoA structure, index both by the same key:
// Existing AoS type — not changed, other code still uses it. #[derive(Debug, Clone)] struct TelemetryFrame { satellite_id: u32, sequence: u64, timestamp_ms: u64, station_id: u8, payload: Vec<u8>, } // New SoA hot path for bulk sequence-number deduplication. // Built from the AoS data; kept in sync on insert. struct FrameSequenceIndex { satellite_ids: Vec<u32>, sequences: Vec<u64>, } impl FrameSequenceIndex { fn from_frames(frames: &[TelemetryFrame]) -> Self { Self { satellite_ids: frames.iter().map(|f| f.satellite_id).collect(), sequences: frames.iter().map(|f| f.sequence).collect(), } } /// Find all duplicate (satellite_id, sequence) pairs — O(n) scan, /// cache-friendly because both vecs are small and contiguous. fn find_duplicates(&self) -> Vec<usize> { let mut seen = std::collections::HashSet::new(); self.satellite_ids .iter() .zip(self.sequences.iter()) .enumerate() .filter_map(|(i, (&sat, &seq))| { if !seen.insert((sat, seq)) { Some(i) } else { None } }) .collect() } } fn main() { let frames: Vec<TelemetryFrame> = (0..5u32).map(|i| TelemetryFrame { satellite_id: i % 3, sequence: (i / 3) as u64, timestamp_ms: 1_700_000_000 + i as u64, station_id: 1, payload: vec![i as u8; 128], }).collect(); let index = FrameSequenceIndex::from_frames(&frames); let dups = index.find_duplicates(); println!("{} duplicate frames found", dups.len()); }
The FrameSequenceIndex co-exists with the original Vec<TelemetryFrame>. Hot operations use the index; display and forwarding use the original frames. The transition is incremental — no global refactor required.
Key Takeaways
-
AoS is natural when operations access all or most fields of one entity at a time, or when random access by index is common. SoA is natural when operations process all entities but only a few fields — the common case in simulation and batch processing.
-
SoA improves cache utilisation for field-subset operations because the working set is smaller: processing
altitudes_km[0..n]loads only altitude data, not names, metadata, or other cold fields. -
Sequential access of a contiguous
Vec<f64>is maximally cache-friendly and SIMD-friendly. The hardware prefetcher predicts linear access patterns; SIMD intrinsics or auto-vectorisation require uniformly-typed contiguous data. -
The practical pattern is field grouping: split hot fields (accessed in every loop) from cold fields (accessed occasionally), and store them in separate vecs. This is a hybrid AoS/SoA approach.
-
Transitioning from AoS to SoA does not require a full rewrite. Extract hot fields into a companion SoA index, keep both in sync on insert, and route hot-path operations through the index.
Lesson 3 — Arena Allocation: Bump Allocators for High-Throughput Telemetry Processing
Module: Foundation — M05: Data-Oriented Design in Rust
Position: Lesson 3 of 3
Source: Rust for Rustaceans — Jon Gjengset, Chapter 9 (GlobalAlloc). SoA and arena patterns synthesized from training knowledge.
Source note: The
GlobalAlloctrait and its safety requirements are covered in Rust for Rustaceans, Ch. 9. The bump allocator pattern and its application to telemetry pipelines are synthesized from training knowledge. Recommended further reading: thebumpalocrate documentation and Alexis Beingessner's "The Allocator API, Bump Allocation, and You" (Gankra.github.io).
Context
The Meridian telemetry processor allocates a Vec<u8> for every telemetry frame payload. At 100,000 frames per second, that is 100,000 malloc/free calls per second hitting the global allocator. The global allocator (jemalloc or the system allocator) is designed for general-purpose allocation: it handles arbitrary sizes, arbitrary lifetimes, and thread-safe concurrent access. This generality has a cost: each allocation acquires an internal lock or performs atomic operations, searches for a free block of appropriate size, and updates allocator metadata.
For short-lived objects that all die together — a batch of frames processed in one scheduling quantum, all freed at the end — a bump allocator eliminates all of that overhead. A bump allocator maintains a pointer into a pre-allocated slab of memory. Each allocation is one pointer addition. Deallocation is a no-op — the entire slab is reclaimed at once when the allocation epoch ends. For the right workload, this is 10–100× faster than the global allocator.
This lesson covers how bump allocators work, when they are appropriate, and how to implement and use them in Rust for high-throughput frame processing.
Core Concepts
The Global Allocator and Its Overhead
Every Box::new(x), Vec::new(), and String::new() in Rust calls the global allocator via the GlobalAlloc trait (Rust for Rustaceans, Ch. 9):
#![allow(unused)] fn main() { // The GlobalAlloc trait (simplified from std): pub unsafe trait GlobalAlloc { unsafe fn alloc(&self, layout: std::alloc::Layout) -> *mut u8; unsafe fn dealloc(&self, ptr: *mut u8, layout: std::alloc::Layout); } }
The default allocator (jemalloc in most production Rust, or the system allocator) handles:
- Thread safety (internal locks or lock-free data structures)
- Size classes and free lists for different allocation sizes
- Fragmentation management
- Returning memory to the OS when freed
For long-lived, variously-sized allocations with arbitrary lifetimes, this is correct and often fast. For thousands of small, short-lived allocations that all have the same lifetime, it is expensive overkill.
Bump Allocation: The Pattern
A bump allocator owns a contiguous slab of memory. Each allocation is a pointer increment:
[slab start] → [offset] → [slab end]
↑
pointer bumps forward on each allocation
Freeing individual allocations is not supported. The entire slab is reset in one operation when all allocations are no longer needed — the "epoch" ends and the offset pointer resets to zero.
Properties:
- Allocation: O(1), typically one integer addition and a bounds check.
- Deallocation: O(1) total for all allocations from one epoch — reset the offset.
- Thread safety: A single-threaded bump allocator has no synchronisation overhead. A thread-local bump allocator gives each thread its own slab with no contention.
- Fragmentation: None within the epoch. Memory is never reused for a different allocation during the epoch — no fragmentation.
- Limitation: Cannot free individual allocations. All allocations from one bump allocator share the same lifetime.
Using bumpalo for Safe Bump Allocation
The bumpalo crate provides a production-quality bump allocator with a safe Rust interface:
// bumpalo is not available in the Playground — this shows the API. // Add to Cargo.toml: bumpalo = { version = "3", features = ["collections"] } // use bumpalo::Bump; // use bumpalo::collections::Vec as BumpVec; // Illustrating the pattern with a manual approach instead: struct BumpArena { slab: Vec<u8>, offset: usize, } impl BumpArena { fn new(capacity: usize) -> Self { Self { slab: vec![0u8; capacity], offset: 0, } } /// Allocate `size` bytes aligned to `align`. /// Returns None if the slab is exhausted. fn alloc(&mut self, size: usize, align: usize) -> Option<&mut [u8]> { // Align the current offset up. let aligned = (self.offset + align - 1) & !(align - 1); let end = aligned + size; if end > self.slab.len() { return None; // Out of space. } self.offset = end; Some(&mut self.slab[aligned..end]) } /// Reset the arena — all previous allocations are invalidated. fn reset(&mut self) { self.offset = 0; } fn used(&self) -> usize { self.offset } fn capacity(&self) -> usize { self.slab.len() } } fn main() { let mut arena = BumpArena::new(4096); // Allocate space for 10 u64 values. let buf = arena.alloc(10 * 8, 8).expect("arena exhausted"); println!("allocated {} bytes, used {}/{}", buf.len(), arena.used(), arena.capacity()); // Reset — all allocations invalidated, slab reused. arena.reset(); println!("after reset: used {}", arena.used()); }
Thread-Local Arenas for Concurrent Processing
For a multi-threaded pipeline where each worker thread processes its own batch of frames, a thread-local arena eliminates all lock contention:
#![allow(unused)] fn main() { use std::cell::RefCell; const ARENA_CAPACITY: usize = 16 * 1024 * 1024; // 16MB per thread thread_local! { // Each worker thread has its own private arena. // No synchronisation — no atomic operations, no locks. static FRAME_ARENA: RefCell<Vec<u8>> = RefCell::new(vec![0u8; ARENA_CAPACITY]); static ARENA_OFFSET: RefCell<usize> = const { RefCell::new(0) }; } fn alloc_frame_buffer(size: usize) -> *mut u8 { FRAME_ARENA.with(|arena| { ARENA_OFFSET.with(|offset| { let mut off = offset.borrow_mut(); let aligned = (*off + 7) & !7; // 8-byte alignment let end = aligned + size; let arena = arena.borrow(); if end > arena.len() { panic!("thread-local arena exhausted — increase ARENA_CAPACITY or reduce batch size"); } *off = end; arena.as_ptr() as *mut u8 }) }) } fn reset_thread_arena() { ARENA_OFFSET.with(|offset| *offset.borrow_mut() = 0); } }
In practice, use bumpalo::Bump in a thread_local! instead of building the unsafe version above. bumpalo handles alignment, growth, and lifetime correctly with a safe interface.
Epoch-Based Processing: The Right Workload
The bump allocator pattern maps naturally onto batch processing where all objects in a batch share the same lifetime:
use std::time::Instant; /// Simulates a frame batch processor using a bump-style pre-allocated pool. /// Each frame's payload is drawn from the batch buffer. /// When the batch is complete, the buffer is reset — no individual frees. struct FrameBatchProcessor { /// Pre-allocated buffer for all frame payloads in one batch. payload_pool: Vec<u8>, pool_offset: usize, batch_size: usize, frames_in_batch: usize, } impl FrameBatchProcessor { fn new(batch_size: usize, max_payload_per_frame: usize) -> Self { Self { payload_pool: vec![0u8; batch_size * max_payload_per_frame], pool_offset: 0, batch_size, frames_in_batch: 0, } } /// Claim space for a frame payload from the pool. /// Returns a mutable slice for the caller to fill. fn claim_payload_slot(&mut self, size: usize) -> Option<&mut [u8]> { if self.frames_in_batch >= self.batch_size { return None; // Batch full. } let end = self.pool_offset + size; if end > self.payload_pool.len() { return None; // Pool exhausted. } let slot = &mut self.payload_pool[self.pool_offset..end]; self.pool_offset = end; self.frames_in_batch += 1; Some(slot) } /// Process the current batch and reset for the next one. /// All payload slots are implicitly freed — no individual deallocation. fn flush_and_reset(&mut self) -> usize { let processed = self.frames_in_batch; self.pool_offset = 0; self.frames_in_batch = 0; processed } } fn main() { let mut processor = FrameBatchProcessor::new(1000, 1024); let start = Instant::now(); // Simulate processing 100,000 frames in batches of 1,000. let mut total = 0; for _batch in 0..100 { for _frame in 0..1000 { // Claim a 256-byte payload slot — no malloc. if let Some(slot) = processor.claim_payload_slot(256) { slot[0] = 0xAA; // Simulate writing frame data. } } total += processor.flush_and_reset(); } let elapsed = start.elapsed(); println!("processed {total} frames in {:?}", elapsed); println!("~{:.0} frames/sec", total as f64 / elapsed.as_secs_f64()); }
When Not to Use Bump Allocation
Bump allocators are not appropriate when:
-
Lifetimes are mixed. If some objects from a batch need to outlive the batch (e.g., forwarding a specific frame to a slow downstream consumer while releasing the rest), a bump allocator cannot express this. The solution is to copy out the long-lived objects to global-allocator memory before resetting.
-
Individual deallocation is required. A bump allocator cannot free one allocation while keeping others alive. Use a pool allocator (fixed-size slots with a free list) if individual deallocation of same-sized objects is needed.
-
Batches are unpredictably sized. If you cannot bound the total allocation size of a batch, the arena may exhaust. Size the arena conservatively — or use
bumpalo, which supports growth by chaining multiple slabs.
Code Examples
Comparing Global vs Arena Allocation for Frame Batches
This benchmark illustrates the overhead difference. Without running it on actual hardware, the expected speedup for small short-lived allocations is 5–20× in favour of the arena.
use std::time::Instant; const FRAMES: usize = 100_000; const PAYLOAD_SIZE: usize = 256; fn bench_global_alloc() -> std::time::Duration { let start = Instant::now(); for _ in 0..FRAMES { // Each Vec::new() + push triggers malloc + memcpy. let mut v = Vec::with_capacity(PAYLOAD_SIZE); for i in 0..PAYLOAD_SIZE { v.push(i as u8); } // Drop at end of loop iteration — free() called 100,000 times. let _ = v; } start.elapsed() } fn bench_arena_alloc() -> std::time::Duration { // Pre-allocate a slab for the entire batch. let mut slab = vec![0u8; FRAMES * PAYLOAD_SIZE]; let start = Instant::now(); let mut offset = 0; for frame_idx in 0..FRAMES { let start_byte = offset; let end_byte = offset + PAYLOAD_SIZE; for (i, byte) in slab[start_byte..end_byte].iter_mut().enumerate() { *byte = (frame_idx ^ i) as u8; } offset = end_byte; } // All frames "freed" by resetting offset to 0 — one operation. offset = 0; let _ = offset; start.elapsed() } fn main() { // Warm up to avoid measurement noise from cold caches. let _ = bench_global_alloc(); let _ = bench_arena_alloc(); let global_time = bench_global_alloc(); let arena_time = bench_arena_alloc(); println!("global alloc: {:?}", global_time); println!("arena alloc: {:?}", arena_time); let speedup = global_time.as_nanos() as f64 / arena_time.as_nanos() as f64; println!("arena speedup: {speedup:.1}×"); }
The arena's advantage grows with allocation count. At 100,000 256-byte frames, the arena avoids 100,000 malloc/free round-trips. The global allocator also has to find and merge free blocks over time as the heap fragments; the arena has zero fragmentation overhead.
Key Takeaways
-
The global allocator (
malloc/free) is general-purpose: thread-safe, handles arbitrary sizes and lifetimes, manages fragmentation. Its generality has overhead — internal synchronisation, free list management, metadata updates. -
A bump allocator eliminates this overhead for objects with a shared lifetime. Allocation is one integer addition. Deallocation is resetting one offset — all objects from one epoch freed simultaneously.
-
The lifetime constraint is the critical requirement. If any object from a bump-allocated batch must outlive the batch, copy it out to the global allocator before resetting. Do not try to mix lifetimes within one arena.
-
Thread-local arenas eliminate all cross-thread contention. Each worker thread gets its own slab; no lock, no atomic operation, no cache line bounce for allocation.
-
Use
bumpaloin production. It handles alignment, growth via chained slabs, and safe lifetimes. Implement your own bump allocator only for educational purposes or inno_stdenvironments where crate dependencies are restricted. -
Profile before optimising. The global allocator is fast for typical workloads. Bump allocation is a targeted optimisation for high-frequency, same-lifetime allocation patterns — not a universal replacement for
VecorBox.
Project — High-Throughput Telemetry Packet Processor
Module: Foundation — M05: Data-Oriented Design in Rust Prerequisite: All three module quizzes passed (≥70%)
Mission Brief
TO: Platform Engineering FROM: Mission Control Systems Lead CLASSIFICATION: UNCLASSIFIED // INTERNAL SUBJECT: RFC-0055 — Telemetry Packet Processor Performance Target
The current Rust-language telemetry processor runs at 62,000 frames per second under sustained load. The conjunction avoidance pipeline requires 100,000 frames per second to maintain sub-10-second delivery windows during peak orbital density events. The gap is 38%. Profiling shows two root causes:
- Allocator pressure. The processor allocates a
Vec<u8>per frame payload on the global heap. At 100k fps, this is 100kmalloc/freeround-trips per second — 18% of CPU time. - Cache waste. The
TelemetryFramestruct packs hot routing fields with cold payload data. Sequential scan of 100k frames for deduplication loads 2.4× more data than the deduplication logic uses.
Your task is to rebuild the processor core using the three techniques from this module: cache-optimal struct layout, SoA separation of hot and cold data, and arena allocation for frame payloads.
System Specification
Frame Structure
#![allow(unused)] fn main() { /// Hot fields — accessed in every pass (routing, deduplication, sorting). /// Must fit in ≤ 32 bytes and be ordered by descending alignment. #[derive(Debug, Clone, Copy)] pub struct FrameHeader { pub timestamp_ms: u64, pub sequence: u64, pub satellite_id: u32, pub byte_count: u16, pub station_id: u8, pub flags: u8, } /// Cold data — accessed only when forwarding to downstream consumers. /// Held as a reference into the batch arena; lifetime is one processing epoch. pub struct FramePayload<'arena> { pub data: &'arena [u8], } }
Processing Pipeline
The processor receives frames in batches of up to 10,000. For each batch:
- Claim payload space from the batch arena for each frame.
- Validate each frame's CRC (simulated: check that
flags & 0x80 == 0). - Deduplicate by
(satellite_id, sequence)— discard duplicates using a SoA scan over hot headers. - Sort the batch by
timestamp_msascending — sort only the header array, not the payloads. - Forward unique sorted frames to a
tokio::sync::mpsc::Sender<ForwardedFrame>. - Reset the arena — all payload allocations freed simultaneously.
Performance Target
- Process 100,000 frames per second sustained across a benchmark of 10,000 batches × 1,000 frames.
- Arena allocation must be used for frame payloads — no
Vec<u8>per payload. - Hot field access (deduplication and sort) must operate on the header array, not the full frame struct.
- Struct size assertions must compile:
size_of::<FrameHeader>() == 24.
Output
A binary crate that:
- Generates synthetic frame batches
- Runs the full pipeline (validate → deduplicate → sort → forward) for 10,000 batches
- Reports frames per second, percentage of duplicates discarded, and arena reset count
- Confirms no per-frame heap allocations occur in the hot path (verified by measuring allocator calls)
Acceptance Criteria
| # | Criterion | Verifiable |
|---|---|---|
| 1 | size_of::<FrameHeader>() == 24 — const assertion in source | Yes — compile-time |
| 2 | Frame payloads allocated from batch arena, not global heap | Yes — code review: no Vec::new() or Box::new() in hot path |
| 3 | Deduplication operates on &[FrameHeader] — no full struct access | Yes — code review |
| 4 | Sort operates on the header array by index — payloads not moved | Yes — code review |
| 5 | Arena resets after each batch — used bytes reset to 0 | Yes — assert in batch loop |
| 6 | Benchmark reports ≥ 100,000 frames/sec on a modern laptop | Yes — timing output |
| 7 | Duplicate detection uses a HashSet<(u32, u64)> on header fields only | Yes — code review |
Hints
Hint 1 — FrameHeader size assertion
#![allow(unused)] fn main() { const _: () = assert!( std::mem::size_of::<FrameHeader>() == 24, "FrameHeader must be 24 bytes — check field order and padding" ); }
Field order for 24 bytes with no padding:
timestamp_ms: u64(8)sequence: u64(8)satellite_id: u32(4)byte_count: u16(2)station_id: u8(1)flags: u8(1) = 24 bytes, alignment = 8, no padding.
Hint 2 — Batch arena design
Pre-allocate the slab once per processor lifetime. Reset between batches:
#![allow(unused)] fn main() { pub struct BatchArena { slab: Vec<u8>, offset: usize, } impl BatchArena { pub fn new(capacity: usize) -> Self { Self { slab: vec![0u8; capacity], offset: 0 } } /// Allocate `size` bytes; returns a mutable slice into the slab. pub fn alloc(&mut self, size: usize) -> Option<&mut [u8]> { let aligned = (self.offset + 7) & !7; // 8-byte alignment let end = aligned + size; if end > self.slab.len() { return None; } self.offset = end; Some(&mut self.slab[aligned..end]) } /// Reset — all previous allocations implicitly freed. pub fn reset(&mut self) { self.offset = 0; } pub fn used(&self) -> usize { self.offset } } }
Size the arena for worst-case batch: max_batch_size * max_payload_size.
Hint 3 — SoA deduplication
Maintain a Vec<FrameHeader> (hot, dense) separate from payloads:
#![allow(unused)] fn main() { use std::collections::HashSet; fn deduplicate(headers: &[FrameHeader]) -> Vec<usize> { let mut seen = HashSet::with_capacity(headers.len()); headers .iter() .enumerate() .filter_map(|(i, h)| { if seen.insert((h.satellite_id, h.sequence)) { Some(i) // Unique — keep this index. } else { None // Duplicate — discard. } }) .collect() } }
The deduplication loop touches only satellite_id and sequence from FrameHeader — 12 bytes of the 24-byte struct. With 1,000 headers per batch at 24 bytes each, the working set is 24 KB — fits in L1 cache.
Hint 4 — Sort headers by timestamp without moving payloads
Sort an index array by headers[i].timestamp_ms, not the headers themselves. This avoids any payload movement:
#![allow(unused)] fn main() { fn sort_by_timestamp(indices: &mut Vec<usize>, headers: &[FrameHeader]) { indices.sort_unstable_by_key(|&i| headers[i].timestamp_ms); } }
Iterating indices in sorted order gives frames in timestamp order without copying or moving any data.
Reference Implementation
Reveal reference implementation
use std::collections::HashSet; use std::time::Instant; // --- FrameHeader --- #[derive(Debug, Clone, Copy)] pub struct FrameHeader { pub timestamp_ms: u64, pub sequence: u64, pub satellite_id: u32, pub byte_count: u16, pub station_id: u8, pub flags: u8, } const _SIZE: () = assert!(std::mem::size_of::<FrameHeader>() == 24); const _ALIGN: () = assert!(std::mem::align_of::<FrameHeader>() == 8); // --- BatchArena --- pub struct BatchArena { slab: Vec<u8>, offset: usize, } impl BatchArena { pub fn new(capacity: usize) -> Self { Self { slab: vec![0u8; capacity], offset: 0 } } pub fn alloc(&mut self, size: usize) -> Option<&mut [u8]> { let aligned = (self.offset + 7) & !7; let end = aligned + size; if end > self.slab.len() { return None; } self.offset = end; Some(&mut self.slab[aligned..end]) } pub fn reset(&mut self) { self.offset = 0; } pub fn used(&self) -> usize { self.offset } } // --- Pipeline --- fn validate(header: &FrameHeader) -> bool { header.flags & 0x80 == 0 // Simulated CRC: high bit = error flag. } fn deduplicate_indices(headers: &[FrameHeader]) -> Vec<usize> { let mut seen = HashSet::with_capacity(headers.len()); headers.iter().enumerate().filter_map(|(i, h)| { if seen.insert((h.satellite_id, h.sequence)) { Some(i) } else { None } }).collect() } fn sort_indices_by_timestamp(indices: &mut Vec<usize>, headers: &[FrameHeader]) { indices.sort_unstable_by_key(|&i| headers[i].timestamp_ms); } fn process_batch( arena: &mut BatchArena, batch: &[(u64, u64, u32, u16, u8, u8, Vec<u8>)], // (ts, seq, sat, bytes, stn, flags, raw_payload) ) -> (usize, usize) { // 1. Fill header array and claim arena slots for payloads. let mut headers: Vec<FrameHeader> = Vec::with_capacity(batch.len()); let mut payload_offsets: Vec<(usize, usize)> = Vec::with_capacity(batch.len()); // (start, len) for (ts, seq, sat, bytes, stn, flags, payload) in batch { let header = FrameHeader { timestamp_ms: *ts, sequence: *seq, satellite_id: *sat, byte_count: *bytes, station_id: *stn, flags: *flags, }; // Validate before claiming arena space. if !validate(&header) { continue; } let slot = match arena.alloc(payload.len()) { Some(s) => s, None => break, // Arena full — drop remaining frames. }; slot.copy_from_slice(payload); let start = arena.used() - payload.len(); payload_offsets.push((start, payload.len())); headers.push(header); } // 2. Deduplicate on hot header array — SoA benefit. let mut unique_indices = deduplicate_indices(&headers); let duplicates = headers.len() - unique_indices.len(); // 3. Sort by timestamp — header array only, no payload movement. sort_indices_by_timestamp(&mut unique_indices, &headers); let forwarded = unique_indices.len(); // 4. Reset arena — all payload slots freed in O(1). arena.reset(); (forwarded, duplicates) } fn main() { const BATCH_SIZE: usize = 1_000; const NUM_BATCHES: usize = 10_000; const MAX_PAYLOAD: usize = 256; let mut arena = BatchArena::new(BATCH_SIZE * MAX_PAYLOAD + 64); // Generate synthetic batch data. let batch: Vec<(u64, u64, u32, u16, u8, u8, Vec<u8>)> = (0..BATCH_SIZE) .map(|i| { let seq = (i / 3) as u64; // Every 3 frames share a sequence — ~33% duplicates. ( 1_700_000_000 + i as u64, seq, (i % 48) as u32, MAX_PAYLOAD as u16, (i % 12) as u8, 0u8, vec![(i & 0xFF) as u8; MAX_PAYLOAD], ) }) .collect(); let start = Instant::now(); let mut total_forwarded = 0usize; let mut total_duplicates = 0usize; let mut resets = 0usize; for _ in 0..NUM_BATCHES { let (fwd, dups) = process_batch(&mut arena, &batch); total_forwarded += fwd; total_duplicates += dups; resets += 1; assert_eq!(arena.used(), 0, "arena must be reset after each batch"); } let elapsed = start.elapsed(); let total_frames = BATCH_SIZE * NUM_BATCHES; let fps = total_frames as f64 / elapsed.as_secs_f64(); println!("--- Telemetry Processor Benchmark ---"); println!("frames: {}", total_frames); println!("forwarded: {}", total_forwarded); println!("duplicates: {} ({:.1}%)", total_duplicates, 100.0 * total_duplicates as f64 / total_frames as f64); println!("resets: {}", resets); println!("elapsed: {:.2?}", elapsed); println!("throughput: {:.0} frames/sec", fps); println!("FrameHeader size: {} bytes", std::mem::size_of::<FrameHeader>()); }
Reflection
The three optimisations in this project compose:
- Struct layout ensures the
FrameHeaderarray is compact (24 bytes/entry, no wasted padding). 24 KB for 1,000 headers — fits in L1 cache, fully available during the deduplication and sort passes. - SoA separation means deduplication and sorting never touch payload data — the payload arena is not in the working set during hot-path operations.
- Arena allocation eliminates 100,000 per-second
malloc/freeround-trips. All payloads for one batch are freed in a single pointer reset.
Each optimisation is independently valuable. Together, they target the three most common sources of throughput bottlenecks in high-frequency data pipelines: allocator pressure, memory bandwidth waste, and cache thrashing.
The benchmarking mindset from Module 6 (Performance and Profiling) will give you the tools to measure these improvements precisely — comparing before and after with criterion, identifying the limiting factor with perf, and validating that the improvements hold under realistic workload conditions.
Module 06 — Performance & Profiling
Track: Foundation — Mission Control Platform
Position: Module 6 of 6 (Foundation track complete)
Source material: Rust for Rustaceans — Jon Gjengset, Chapter 6; criterion, cargo-flamegraph, perf, dhat documentation
Quiz pass threshold: 70% on all three lessons to unlock the project
- Mission Context
- What You Will Learn
- Lessons
- Capstone Project — Meridian Control Plane Performance Audit
- Prerequisites
- Foundation Track Complete
Mission Context
The Module 5 telemetry processor achieves 100,000 frames per second in isolation. The integrated control plane pipeline runs at 71,000. The 29% gap is not in the algorithm — it is in measurement blind spots: unmeasured allocations, unverified assumptions about what the compiler optimises away, and code paths that look fast but are not.
Performance engineering without measurement is optimism. This module provides the measurement toolkit: criterion for reliable benchmarks, flamegraph and perf for identifying hot paths, and allocation counting for detecting hidden heap overhead. The project combines all three into a structured audit that turns a performance gap into a documented, measured, verified improvement.
What You Will Learn
By the end of this module you will be able to:
- Identify the three failure modes of naive
Instant::now()benchmarks: dead-code elimination, constant folding, and I/O overhead masking the function under test - Apply
std::hint::black_boxcorrectly to both inputs and outputs to prevent compiler optimisations from invalidating benchmark results - Write
criterionbenchmarks with proper setup/measurement separation, interpret confidence intervals and p-values, and run parameterised benchmarks across input sizes - Configure the release profile with debug symbols for profiling, generate flamegraphs with
cargo flamegraph, and identify hot paths from flamegraph visual patterns - Read
perf statoutput to diagnose whether a workload is compute-bound, memory-bound, or branch-prediction-bound before generating a flamegraph - Use a
#[global_allocator]counting wrapper to count allocations in a specific code path, embed zero-allocation assertions in CI, and eliminate common hidden allocation sources (HashMap::new(),Vec::collect(),format!())
Lessons
Lesson 1 — Benchmarking with criterion: Writing Reliable Microbenchmarks
Covers the three failure modes of naive timing loops, std::hint::black_box placement for both input and output, criterion API and confidence interval interpretation, setup/measurement separation, benchmarking at realistic input sizes, and reading the statistical significance output.
Key question this lesson answers: How do you know your benchmark is measuring what you think it is, and how do you distinguish a real performance change from measurement noise?
→ lesson-01-benchmarking.md / lesson-01-quiz.toml
Lesson 2 — CPU Profiling with flamegraph and perf: Finding Hot Paths
Covers the sampling profiler model, configuring release builds with debug symbols for profiling, perf stat hardware counter diagnosis (IPC, cache miss rate, branch miss rate), cargo flamegraph workflow, reading flamegraph visual patterns (wide flat bars, deep towers, distributed overhead), and #[inline(never)] for profiling visibility.
Key question this lesson answers: Which function is consuming the most CPU time, and how do you distinguish a compute-bound bottleneck from a memory-bound one?
→ lesson-02-flamegraph.md / lesson-02-quiz.toml
Lesson 3 — Memory Profiling: Heap Allocation Tracking and Reducing Allocator Pressure
Covers the allocation cost model, #[global_allocator] counting wrappers for exact per-path allocation counts, HashMap::with_capacity and Vec::with_capacity pre-allocation, clear() for buffer reuse across batches, dhat for call-site-attributed heap profiling, and CI-embedded zero-allocation assertions.
Key question this lesson answers: How many allocations happen in the hot path, which call sites are responsible, and how do you make that a CI assertion rather than a one-time finding?
→ lesson-03-memory-profiling.md / lesson-03-quiz.toml
Capstone Project — Meridian Control Plane Performance Audit
Apply the full three-phase audit workflow to the integrated telemetry pipeline: establish a criterion baseline, generate a flamegraph to identify the hot path, use a counting allocator to quantify per-stage allocation overhead, implement the highest-impact fix, and verify the improvement is statistically significant (p < 0.05). Document findings in audit.md.
Acceptance is against 7 verifiable criteria including correct criterion usage, flamegraph generation, per-stage allocation counts, a documented fix, and a p < 0.05 improvement.
→ project-performance-audit.md
Prerequisites
Modules 1–5 must be complete. Module 5 (Data-Oriented Design) established the optimisations being measured here — this module gives you the tools to verify that those optimisations actually work and to prevent future regressions. Module 2 (Concurrency Primitives) introduced atomic operations, which are used by the counting allocator in Lesson 3.
Foundation Track Complete
With Module 6 complete, the Foundation track is done. The six modules cover the complete toolset for building Meridian's control plane in Rust: async task scheduling, concurrency primitives, message-passing architectures, network I/O, data-oriented design, and performance measurement. The four specialisation tracks — Database Internals, Data Pipelines, Data Lakes, and Distributed Systems — are now unlocked and can be taken in any order.
Lesson 1 — Benchmarking with criterion: Writing Reliable Microbenchmarks
Module: Foundation — M06: Performance & Profiling Position: Lesson 1 of 3 Source: Rust for Rustaceans — Jon Gjengset, Chapter 6
Context
The Module 5 project processor targets 100,000 frames per second. You have a number — but how confident are you in it? The benchmark loop used Instant::now() / elapsed() around a single iteration. That measurement is subject to three failure modes documented in Rust for Rustaceans Ch. 6: performance variance between runs (caused by CPU temperature, OS scheduler interrupts, memory layout), compiler optimisation eliminating the code under test entirely, and I/O overhead masking the actual function cost. A timing loop that contains a println! is usually measuring the speed of terminal output, not your function.
The criterion crate addresses all three. It runs each benchmark hundreds of times, applies statistical analysis to separate signal from noise, detects and reports outliers, and generates comparison reports that tell you whether a change is a real regression or measurement noise. When Meridian's CI pipeline regresses the frame processor throughput by 15%, criterion is how you prove the regression is real, quantify its size, and track it to the specific commit.
Source: Rust for Rustaceans, Chapter 6 (Gjengset)
Core Concepts
Why Instant::now() Loops Are Not Enough
Consider this naive benchmark:
fn my_function(data: &[u32]) -> u64 { data.iter().map(|&x| x as u64).sum() } fn main() { let data: Vec<u32> = (0..1_000).collect(); let start = std::time::Instant::now(); for _ in 0..10_000 { let _ = my_function(&data); } println!("took {:?}", start.elapsed()); }
Two problems. First, the compiler may eliminate my_function entirely — the result _ is discarded, so nothing in the code requires the computation to happen (Rust for Rustaceans, Ch. 6). In release mode, the loop body may compile to nothing. Second, a single run on a loaded machine is noise, not signal. CPU clock scaling, branch predictor warmup, and OS scheduler preemption all add variance. A function that takes 50µs may measure anywhere from 40µs to 200µs depending on external conditions.
criterion Basics
Add criterion to Cargo.toml:
[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }
[[bench]]
name = "frame_processor"
harness = false
A criterion benchmark in benches/frame_processor.rs:
// Note: criterion is a dev-dependency — not available in the Playground. // This demonstrates the API. In production add criterion = "0.5" to Cargo.toml. // use criterion::{black_box, criterion_group, criterion_main, Criterion}; // fn bench_deduplication(c: &mut Criterion) { // let headers: Vec<u64> = (0..1000).collect(); // c.bench_function("dedup_1000", |b| { // b.iter(|| { // // black_box prevents the compiler from optimising away the input // // or treating the result as dead code. // black_box(deduplicate(black_box(&headers))) // }) // }); // } // // criterion_group!(benches, bench_deduplication); // criterion_main!(benches); // Illustrating the structure with std::hint::black_box instead: fn deduplicate(headers: &[u64]) -> usize { let mut seen = std::collections::HashSet::new(); headers.iter().filter(|&&h| seen.insert(h)).count() } fn main() { let headers: Vec<u64> = (0..1000).collect(); // Warm up for _ in 0..100 { std::hint::black_box(deduplicate(std::hint::black_box(&headers))); } // Measure let iterations = 10_000; let start = std::time::Instant::now(); for _ in 0..iterations { std::hint::black_box(deduplicate(std::hint::black_box(&headers))); } let elapsed = start.elapsed(); println!("deduplicate(1000): {:.2?} per iteration", elapsed / iterations); }
black_box: Preventing Dead-Code Elimination
std::hint::black_box (or criterion::black_box) is the key primitive for correct benchmarks. It is an identity function that tells the compiler: assume this value is used in some arbitrary way (Rust for Rustaceans, Ch. 6). This prevents two failure modes:
Eliminating dead computation: if the benchmark discards the result with let _ = expensive(), the compiler may eliminate the call. black_box(expensive()) forces the computation to occur because the compiler must assume black_box uses its argument.
Constant-folding inputs: if the input to a benchmark is a compile-time constant, the compiler may pre-compute the result at compile time. black_box(input) forces the compiler to treat the input as runtime-unknown.
fn sum_slice(data: &[u32]) -> u64 { data.iter().map(|&x| x as u64).sum() } fn main() { let data: Vec<u32> = (0..1_000).collect(); // WRONG: compiler may eliminate the call — result discarded, input known. { let start = std::time::Instant::now(); for _ in 0..10_000 { let _ = sum_slice(&data); } println!("(likely wrong) took {:?}", start.elapsed()); } // CORRECT: black_box prevents dead-code elimination and constant folding. { let start = std::time::Instant::now(); for _ in 0..10_000 { std::hint::black_box(sum_slice(std::hint::black_box(&data))); } println!("(correct) took {:?}", start.elapsed()); } }
Note the placement: black_box on the input prevents constant folding (the compiler must treat the slice as runtime-unknown). black_box on the output prevents dead-code elimination (the compiler must assume the result is used).
Benchmark Structure Best Practices
Separate setup from measurement. Criterion's b.iter(|| { ... }) closure is the measured unit. Anything outside it is setup and runs once. Constructing test data inside the measured closure inflates the result with allocation cost.
// Illustrating the pattern with manual timing: fn build_test_headers(n: usize) -> Vec<u64> { // This is setup — not what we want to measure. (0..n as u64).collect() } fn deduplicate_headers(headers: &[u64]) -> usize { let mut seen = std::collections::HashSet::new(); headers.iter().filter(|&&h| seen.insert(h)).count() } fn bench_with_setup() { // Build test data ONCE — not inside the measured loop. let headers = build_test_headers(1000); let iterations = 100_000u32; let start = std::time::Instant::now(); for _ in 0..iterations { // Only the function under test is measured. std::hint::black_box(deduplicate_headers(std::hint::black_box(&headers))); } let elapsed = start.elapsed(); println!("deduplicate(1000): {:.2?}/iter", elapsed / iterations); } fn main() { bench_with_setup(); }
Benchmark at realistic input sizes. A function that is O(n log n) may be cache-bound at n=100 and compute-bound at n=100,000. Benchmark at the sizes you actually use in production. For Meridian's conjunction screen, that is 50,000 objects — not 100.
Use criterion's input size parameter for scaling analysis. BenchmarkGroup lets you benchmark the same function at multiple input sizes and plot throughput vs. size. The slope of that plot tells you whether your function is cache-bound (throughput drops sharply above L2 size) or compute-bound (throughput scales smoothly).
Interpreting Criterion Output
cargo bench produces output like:
dedup_1000 time: [12.453 µs 12.501 µs 12.554 µs]
change: [-2.1431% -1.6789% -1.1920%] (p=0.00 < 0.05)
Performance has improved.
The three numbers are the lower bound, estimate, and upper bound of the 95% confidence interval. If you see a wide interval (e.g., [5 µs, 50 µs, 200 µs]), measurement variance is high — run on a quieter machine, increase iteration count, or use --profile-time for more samples.
The change line compares against the previous run (stored in target/criterion). A p-value below 0.05 means the change is statistically significant with 95% confidence. Changes with p > 0.05 are likely noise.
Code Examples
Benchmarking the Telemetry Processor with Parameterised Input Sizes
This example uses black_box correctly and varies input size to understand the performance profile across the range of realistic batch sizes.
use std::collections::HashSet; use std::hint::black_box; use std::time::{Duration, Instant}; fn deduplicate(headers: &[(u32, u64)]) -> Vec<usize> { let mut seen = HashSet::with_capacity(headers.len()); headers.iter().enumerate() .filter_map(|(i, &(sat, seq))| { if seen.insert((sat, seq)) { Some(i) } else { None } }) .collect() } fn sort_by_timestamp(indices: &mut Vec<usize>, timestamps: &[u64]) { indices.sort_unstable_by_key(|&i| timestamps[i]); } /// Run `iterations` iterations, return median per-iteration duration. fn time_fn<F: Fn()>(f: F, iterations: u32) -> Duration { // Warm up — let branch predictor and instruction cache settle. for _ in 0..10 { f(); } let start = Instant::now(); for _ in 0..iterations { f(); } start.elapsed() / iterations } fn main() { println!("{:<10} {:>15} {:>15} {:>15}", "n_frames", "dedup (µs)", "sort (µs)", "total (µs)"); println!("{}", "-".repeat(55)); for &n in &[100usize, 500, 1_000, 5_000, 10_000] { // Build test data once — not in the measured loop. let headers: Vec<(u32, u64)> = (0..n) .map(|i| ((i % 48) as u32, (i / 3) as u64)) // ~33% duplicates .collect(); let timestamps: Vec<u64> = (0..n).map(|i| (n - i) as u64).collect(); let dedup_time = time_fn(|| { black_box(deduplicate(black_box(&headers))); }, 10_000); let sort_time = time_fn(|| { let mut indices = (0..n).collect::<Vec<_>>(); sort_by_timestamp(black_box(&mut indices), black_box(×tamps)); black_box(indices); }, 10_000); println!("{:<10} {:>15.2} {:>15.2} {:>15.2}", n, dedup_time.as_secs_f64() * 1e6, sort_time.as_secs_f64() * 1e6, (dedup_time + sort_time).as_secs_f64() * 1e6, ); } }
The slope of the dedup time as n grows reveals whether the function is O(n) (linear slope on a linear plot) or showing cache effects (steeper slope beyond a threshold). If dedup time grows faster than linearly above n=1000, the HashSet working set has exceeded L1 cache and you are paying for L2/L3 misses.
Key Takeaways
-
Instant::now()around a single-pass loop is not a reliable benchmark. Performance variance between runs, compiler dead-code elimination, and I/O in the loop can all produce completely wrong numbers (Rust for Rustaceans, Ch. 6). -
std::hint::black_box(orcriterion::black_box) prevents the compiler from eliminating benchmark code as dead. Apply it to both the input (prevent constant folding) and the output (prevent dead-code elimination). -
criterionruns each benchmark the statistically appropriate number of times, computes confidence intervals, detects outliers, and reports whether changes between runs are statistically significant. Usep < 0.05as the threshold for treating a change as real. -
Separate setup from measurement. Build test data outside the measured closure. Benchmark at realistic production input sizes, not toy sizes. Use parameterised input to understand performance scaling behaviour.
-
A 95% confidence interval that is narrow (< 5% spread) indicates a reliable measurement. A wide interval indicates high variance — run on a quieter machine or use
cargo bench -- --sample-size 200for more samples.
Lesson 2 — CPU Profiling with flamegraph and perf: Finding Hot Paths
Module: Foundation — M06: Performance & Profiling
Position: Lesson 2 of 3
Source: Synthesized from training knowledge (cargo flamegraph, perf, pprof documentation)
Source note: This lesson synthesizes from
cargo-flamegraph, Linuxperf, andpprofdocumentation. Verify specific CLI flags against your installed version ofperf— options vary between kernel versions.
Context
criterion tells you that the frame deduplication function takes 12.5µs. It does not tell you why. Is it the HashSet insertions? The iterator chain? A memory allocation path? To answer that question, you need a CPU profiler — a tool that samples the program's call stack at regular intervals and shows you where time is being spent.
The flamegraph is the standard visualisation for this: a call tree where width encodes time and the call stack grows upward. The widest frames at the top are where your program actually spends its time. A deep narrow tower is a deep but fast call chain. A wide flat bar is a hot leaf function. Reading flamegraphs is a skill that takes a few profiling sessions to develop, but the insight-to-effort ratio is very high.
This lesson covers the two-tool profiling workflow for Rust on Linux: perf to collect samples and cargo flamegraph to generate the visualisation.
Core Concepts
The Profiling Workflow
CPU profiling works by sampling: the OS timer fires at regular intervals (typically 99 Hz or 999 Hz), captures the current call stack, and records the sample. After the program finishes, the accumulated stack samples are folded into a call tree and rendered as a flamegraph. Functions that appear in more samples are proportionally wider in the graph.
The standard workflow:
# 1. Build with debug info (but optimisations enabled — profile release code).
# debug = true in [profile.release] preserves symbols without losing optimisations.
# Add to Cargo.toml:
# [profile.release]
# debug = true
# 2. Install cargo-flamegraph (wraps perf or dtrace depending on platform).
cargo install flamegraph
# 3. Profile the binary.
cargo flamegraph --bin meridian-processor -- --frames 100000
# 4. Open the generated flamegraph.svg in a browser.
# Click any frame to zoom in. Search by function name.
On Linux, cargo flamegraph uses perf record under the hood. On macOS, it uses dtrace. The output is always a flamegraph.svg.
Building for Profiling: Debug Symbols in Release Mode
Profiling a debug build measures the wrong thing — debug code contains bounds checks, non-inlined functions, and other overhead that does not exist in production. Profile release builds.
But release builds strip debug symbols by default — the flamegraph shows mangled symbol addresses instead of function names. The fix: add debug info to the release profile without disabling optimisations:
# Cargo.toml
[profile.release]
debug = true # Include debug symbols (DWARF info).
opt-level = 3 # Keep full optimisation.
# Note: debug = true increases binary size (~3-10×) but has negligible
# runtime overhead. Strip the binary before deploying to production.
Alternatively, use the profiling profile convention:
[profile.profiling]
inherits = "release"
debug = true
Then cargo build --profile profiling && cargo flamegraph --profile profiling.
Reading a Flamegraph
A flamegraph stacks call frames vertically — the root (main) at the bottom, callees above. Width is proportional to the percentage of samples that included that frame in the call stack. The top-most wide frames are the actual hot spots.
Patterns to recognise:
Wide flat bar at the top — a leaf function consuming significant CPU. Investigate whether it can be optimised directly (algorithm, data structure choice) or eliminated (caching, avoiding the call).
Wide bar with many narrow children — a function that spends time distributed across many callees. No single child is dominant; the function itself may be doing overhead work.
Deep narrow tower — a long call chain that is individually fast. Usually indicates overhead from indirection (dynamic dispatch, many small function calls). #[inline] or refactoring may help.
[unknown] frames — samples from code without debug symbols (runtime, system libraries). Usually not actionable. Can be reduced by profiling with kernel symbols (--call-graph dwarf).
perf stat: Hardware Counter Snapshot
Before generating a flamegraph, perf stat gives a quick diagnostic of what kind of bottleneck you have:
perf stat -e cycles,instructions,cache-misses,cache-references,branch-misses \
./target/release/meridian-processor --frames 100000
Output:
Performance counter stats for './target/release/meridian-processor':
4,521,847,032 cycles
6,234,891,045 instructions # 1.38 insn per cycle
12,847,334 cache-misses # 8.23% of all cache refs
156,234,123 cache-references
2,341,234 branch-misses # 0.21% of all branches
Instructions per cycle (IPC): 1.38 is moderate. Modern CPUs can sustain 3–4 IPC. Low IPC (< 1.5) suggests the processor is stalling — often on memory latency (cache misses) or branch mispredictions.
Cache miss rate: 8.23% is high. Typically < 1% is good. High cache miss rates point to the data layout problems covered in Module 5 — large structs, poor locality, random access patterns.
Branch miss rate: 0.21% is normal. > 5% suggests unpredictable branches — sorting or using branchless comparisons may help.
cargo-flamegraph in Practice
// Example: a function with a deliberately inefficient hot path // to demonstrate profiling workflow. fn find_conjunctions_naive( altitudes: &[f64], norad_ids: &[u32], threshold_km: f64, ) -> Vec<(u32, u32)> { let mut alerts = Vec::new(); let n = altitudes.len(); for i in 0..n { for j in (i + 1)..n { // This inner loop is O(n²) — will show as wide in a flamegraph. // The call to f64::abs() will likely appear as a hot child. if (altitudes[i] - altitudes[j]).abs() < threshold_km { alerts.push((norad_ids[i], norad_ids[j])); } } } alerts } fn main() { // Simulate workload for profiling. let n = 5_000; let altitudes: Vec<f64> = (0..n).map(|i| 400.0 + (i as f64) * 0.1).collect(); let norad_ids: Vec<u32> = (0..n as u32).collect(); let alerts = find_conjunctions_naive(&altitudes, &norad_ids, 2.0); println!("{} conjunction alerts", alerts.len()); }
In a flamegraph of this code, find_conjunctions_naive will be wide at the top (O(n²) iterations), with the subtraction and abs() call visible as the actual hot operations. The outer loop iteration overhead and the Vec::push for matches will also appear.
The flamegraph makes it immediately obvious: the inner loop is the hot path. The fix — using a sort + linear scan instead of O(n²) comparison — is visible from the profile before reading a single line of source.
Annotating Hot Functions with #[inline(never)]
By default, the compiler inlines small functions, which is good for performance but bad for profiling — inlined calls disappear into their callers in the flamegraph. For functions you specifically want to measure in isolation:
// Prevents inlining — this function will appear as a distinct frame in the flamegraph. // Remove before production use if inlining is desired for performance. #[inline(never)] fn compute_altitude_delta(a: f64, b: f64) -> f64 { (a - b).abs() } fn main() { // In a flamegraph, compute_altitude_delta will appear as its own frame, // making it easy to see exactly how much time the subtraction + abs costs. let result = compute_altitude_delta(410.0, 408.5); println!("{}", result); }
Use #[inline(never)] temporarily during profiling investigations. Remove it afterward — the compiler's inlining decisions are generally correct for production code.
Code Examples
A Profiling-Instrumented Processor Binary
The entry point for profiling runs a realistic workload of sufficient duration for the sampler to collect meaningful data. Too short (< 1 second) and there are too few samples for a reliable flamegraph.
use std::collections::HashSet; use std::hint::black_box; use std::time::Instant; fn build_test_data(n: usize) -> (Vec<u64>, Vec<(u32, u64)>) { let timestamps: Vec<u64> = (0..n as u64).rev().collect(); let headers: Vec<(u32, u64)> = (0..n) .map(|i| ((i % 48) as u32, (i / 3) as u64)) .collect(); (timestamps, headers) } #[inline(never)] // Visible as its own frame in flamegraph fn dedup_pass(headers: &[(u32, u64)]) -> Vec<usize> { let mut seen = HashSet::with_capacity(headers.len()); headers.iter().enumerate() .filter_map(|(i, &(sat, seq))| { if seen.insert((sat, seq)) { Some(i) } else { None } }) .collect() } #[inline(never)] // Visible as its own frame in flamegraph fn sort_pass(indices: &mut Vec<usize>, timestamps: &[u64]) { indices.sort_unstable_by_key(|&i| timestamps[i]); } fn process_batch(timestamps: &[u64], headers: &[(u32, u64)]) -> usize { let mut indices = dedup_pass(headers); sort_pass(&mut indices, timestamps); indices.len() } fn main() { // Run enough iterations for perf to collect ~1000+ samples. // At 99 Hz sampling, we need ~10 seconds of CPU time. let (timestamps, headers) = build_test_data(10_000); let batches = 50_000; let start = Instant::now(); let mut total = 0usize; for _ in 0..batches { total += black_box(process_batch( black_box(×tamps), black_box(&headers), )); } let elapsed = start.elapsed(); println!("processed {} batches, {} unique frames", batches, total); println!("throughput: {:.0} batches/sec", batches as f64 / elapsed.as_secs_f64()); println!("wall time: {:.2?}", elapsed); }
The #[inline(never)] attributes on dedup_pass and sort_pass ensure they appear as distinct frames in the flamegraph. The black_box calls prevent dead-code elimination from interfering with the profiling workload. The loop runs long enough to collect statistically meaningful samples.
Key Takeaways
-
Profile release builds with debug symbols (
debug = truein[profile.release]). Profiling debug builds measures overhead that does not exist in production. -
perf statprovides a hardware counter snapshot before you generate a flamegraph. High cache miss rate (> 5%) points to data layout issues; low IPC (< 1.5) suggests the processor is stalling on memory; high branch miss rate suggests unpredictable conditionals. -
In a flamegraph, width encodes time. Wide frames at the top are hot leaf functions — the actual bottleneck. Wide frames with narrow children indicate distributed overhead. Deep narrow towers indicate fast call chains, not hot spots.
-
#[inline(never)]temporarily prevents a function from being inlined so it appears as a distinct frame in the profiler. Remove it after the investigation — inlining is correct for production code. -
A profiling session should last at least 5–10 seconds of CPU time for reliable sample counts at 99 Hz. Use a workload that resembles production access patterns at production input sizes.
Lesson 3 — Memory Profiling: Heap Allocation Tracking and Reducing Allocator Pressure
Module: Foundation — M06: Performance & Profiling
Position: Lesson 3 of 3
Source: Synthesized from training knowledge (dhat, heaptrack, jemalloc statistics, custom allocator wrappers)
Source note: This lesson synthesizes from
dhat(Valgrind/DHAT profiler),heaptrackdocumentation, and allocator counting patterns. Verifydhat-rsAPI against the current crate version.
Context
The CPU flamegraph from Lesson 2 shows the telemetry processor spending 18% of time in malloc and free. The criterion benchmark from Lesson 1 confirms: 12.5µs per 1000-frame batch, 2.3µs of which is allocator overhead. The fix from Module 5 — arena allocation — eliminates this. But before implementing it, you need to know: exactly how many allocations happen per batch? Which call sites are responsible? Are there unexpected allocations from library code that you assumed was allocation-free?
Memory profiling answers these questions. Unlike CPU profiling (which samples stochastically), allocation profiling intercepts every alloc and dealloc call — giving you exact counts, sizes, and call sites. The tools: dhat for lightweight in-process counting, heaptrack for comprehensive heap timeline recording, and a custom counting allocator for targeted measurements in CI.
Core Concepts
The Allocation Cost Model
Every Vec::new(), Box::new(), String::from(), and collection growth hits the global allocator. The actual cost depends on the allocator (jemalloc is faster than the system allocator for concurrent workloads), the allocation size (small allocations have higher per-byte overhead), and contention (the global allocator serialises concurrent allocations internally).
Profiling allocation patterns reveals three categories of allocatable objects:
Long-lived allocations — startup config, connection state, per-session data structures. These are unavoidable and not a throughput problem.
Per-batch allocations — temporary buffers, work vectors, accumulators that are created and freed within one processing epoch. These are the target of arena allocation — eliminate them with pre-allocation.
Unexpected allocations — library calls that allocate internally even though the API looks allocation-free. format!(), HashMap::new(), Vec::collect() when the iterator doesn't know its size. These show up in memory profiles and are often surprising.
Counting Allocations: The Simplest Approach
Before reaching for a full memory profiler, a counting allocator wrapper tells you exactly how many allocations occur in a specific code path. This works in any environment and imposes very low overhead:
use std::alloc::{GlobalAlloc, Layout, System}; use std::sync::atomic::{AtomicU64, Ordering}; /// Wraps the system allocator and counts every alloc/dealloc. struct CountingAllocator; static ALLOC_COUNT: AtomicU64 = AtomicU64::new(0); static DEALLOC_COUNT: AtomicU64 = AtomicU64::new(0); static ALLOC_BYTES: AtomicU64 = AtomicU64::new(0); unsafe impl GlobalAlloc for CountingAllocator { unsafe fn alloc(&self, layout: Layout) -> *mut u8 { ALLOC_COUNT.fetch_add(1, Ordering::Relaxed); ALLOC_BYTES.fetch_add(layout.size() as u64, Ordering::Relaxed); System.alloc(layout) } unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) { DEALLOC_COUNT.fetch_add(1, Ordering::Relaxed); System.dealloc(ptr, layout) } } #[global_allocator] static ALLOCATOR: CountingAllocator = CountingAllocator; fn snapshot() -> (u64, u64, u64) { ( ALLOC_COUNT.load(Ordering::Relaxed), DEALLOC_COUNT.load(Ordering::Relaxed), ALLOC_BYTES.load(Ordering::Relaxed), ) } fn reset_counters() { ALLOC_COUNT.store(0, Ordering::Relaxed); DEALLOC_COUNT.store(0, Ordering::Relaxed); ALLOC_BYTES.store(0, Ordering::Relaxed); } // --- Application code under test --- fn process_frames(frames: &[Vec<u8>]) -> usize { // Allocates a HashSet internally. let mut seen = std::collections::HashSet::new(); frames.iter().filter(|f| seen.insert(f.as_ptr())).count() } fn main() { let frames: Vec<Vec<u8>> = (0..100).map(|i| vec![i as u8; 256]).collect(); // Reset — we only want to count allocations from process_frames. reset_counters(); let result = process_frames(&frames); let (allocs, deallocs, bytes) = snapshot(); println!("process_frames({}) result: {}", frames.len(), result); println!(" allocations: {allocs}"); println!(" deallocations: {deallocs}"); println!(" bytes: {bytes}"); }
The output reveals exactly how many times the global allocator was called inside process_frames. If the count is non-zero when it should be zero (the function is supposed to be allocation-free), you have a hidden allocation to hunt down.
Common Hidden Allocation Sources
HashSet::new() and HashMap::new() — these start empty (no allocation) but trigger an allocation on the first insert. HashSet::with_capacity(n) pre-allocates for n elements, avoiding the first realloc. Using with_capacity eliminates the grow-and-rehash allocation that occurs when the initial capacity is exceeded.
Vec::collect() without size hint — if the iterator does not implement ExactSizeIterator, the Vec starts with a small capacity and grows (allocating) as elements arrive. Call .collect::<Vec<_>>() only when you know the iterator is small or provide a size hint via .size_hint().
format!() and string operations — every format! call allocates a String. In hot paths, prefer writing to a pre-allocated String with write! or push_str, or avoid String entirely in favour of a stack buffer.
Arc::clone() is not free — cloning an Arc does not allocate, but Arc::new() does. In a hot path, pre-create the Arc at batch setup time rather than per-frame.
Iterator adapters that buffer — .sorted() from itertools allocates a Vec. .flat_map() with iterators that have non-trivial state may allocate. Check whether the adapter is allocation-free before using it in a hot path.
dhat: In-Process Heap Profiling
dhat (from Valgrind, with a Rust port via the dhat crate) instruments every allocation with a call-site stack trace. It produces a profile that shows, for each allocation site, the total bytes allocated, the peak live bytes, and the number of calls:
# Cargo.toml
[dependencies]
dhat = { version = "0.3", optional = true }
[features]
dhat-heap = ["dhat"]
// In main.rs — only active when the dhat-heap feature is enabled. // cfg gate prevents any overhead in production builds. #[cfg(feature = "dhat-heap")] #[global_allocator] static ALLOC: dhat::Alloc = dhat::Alloc; fn main() { #[cfg(feature = "dhat-heap")] let _profiler = dhat::Profiler::new_heap(); // ... run workload ... println!("dhat profile written on drop of _profiler"); }
Run with: cargo run --features dhat-heap. At program exit, dhat writes dhat-heap.json. View it at https://nnethercote.github.io/dh_view/dh_view.html.
The profile shows total bytes allocated per call site — letting you immediately identify which function is responsible for most allocations, even if that function is inside a library you did not write.
Reducing Allocator Pressure: Patterns
Pre-allocate with with_capacity:
fn process_batch_optimised(n: usize) -> Vec<usize> { // Pre-allocate with known capacity — no reallocation on push. let mut result = Vec::with_capacity(n); let mut seen = std::collections::HashSet::with_capacity(n); for i in 0..n { if seen.insert(i % (n / 2)) { // ~50% are unique result.push(i); } } result } fn main() { let batch = process_batch_optimised(10_000); println!("{} unique items", batch.len()); }
Reuse allocations across calls with clear() instead of dropping and reallocating:
struct FrameProcessor { // Persistent buffers — allocated once, reused every batch. seen: std::collections::HashSet<(u32, u64)>, indices: Vec<usize>, } impl FrameProcessor { fn new(expected_batch_size: usize) -> Self { Self { seen: std::collections::HashSet::with_capacity(expected_batch_size), indices: Vec::with_capacity(expected_batch_size), } } fn process(&mut self, headers: &[(u32, u64)]) -> &[usize] { // clear() retains the allocation — no new malloc per batch. self.seen.clear(); self.indices.clear(); for (i, &(sat, seq)) in headers.iter().enumerate() { if self.seen.insert((sat, seq)) { self.indices.push(i); } } &self.indices } } fn main() { let mut processor = FrameProcessor::new(1000); let headers: Vec<(u32, u64)> = (0..1000) .map(|i| ((i % 48) as u32, (i / 3) as u64)) .collect(); for batch_num in 0..5 { let unique = processor.process(&headers); println!("batch {batch_num}: {} unique frames", unique.len()); } }
The FrameProcessor struct holds the HashSet and Vec across batch calls. Each batch calls clear() — which sets the length to zero but retains the allocated capacity. After the first batch warms up the allocation, subsequent batches make zero allocator calls for these data structures.
Code Examples
Measuring Allocations Per Batch in CI
Embedding an allocation count assertion in CI ensures that future refactors do not accidentally reintroduce per-frame allocations:
use std::alloc::{GlobalAlloc, Layout, System}; use std::sync::atomic::{AtomicU64, Ordering::Relaxed}; struct CountingAllocator; static ALLOC_COUNT: AtomicU64 = AtomicU64::new(0); unsafe impl GlobalAlloc for CountingAllocator { unsafe fn alloc(&self, layout: Layout) -> *mut u8 { ALLOC_COUNT.fetch_add(1, Relaxed); System.alloc(layout) } unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) { System.dealloc(ptr, layout) } } #[global_allocator] static ALLOCATOR: CountingAllocator = CountingAllocator; // --- Frame processor under test --- struct Processor { seen: std::collections::HashSet<(u32, u64)>, indices: Vec<usize>, } impl Processor { fn new(cap: usize) -> Self { Self { seen: std::collections::HashSet::with_capacity(cap), indices: Vec::with_capacity(cap), } } fn process_batch(&mut self, headers: &[(u32, u64)]) -> usize { self.seen.clear(); self.indices.clear(); for (i, &key) in headers.iter().enumerate() { if self.seen.insert(key) { self.indices.push(i); } } self.indices.len() } } fn main() { let headers: Vec<(u32, u64)> = (0..1000) .map(|i| ((i % 48) as u32, (i / 3) as u64)) .collect(); let mut processor = Processor::new(1000); // Warm up — first batch may allocate as HashSet grows. processor.process_batch(&headers); // Reset counter — subsequent batches should be allocation-free. ALLOC_COUNT.store(0, Relaxed); // Run 100 batches. for _ in 0..100 { std::hint::black_box(processor.process_batch(std::hint::black_box(&headers))); } let allocs = ALLOC_COUNT.load(Relaxed); println!("allocations across 100 batches: {allocs}"); // In CI: assert!(allocs == 0, "unexpected allocations in hot path: {allocs}"); if allocs == 0 { println!("PASS: hot path is allocation-free after warm-up"); } else { println!("WARN: {allocs} unexpected allocations detected"); } }
The pattern: warm up once (let pre-allocated capacity fill), reset the counter, then assert zero allocations across subsequent batches. This assertion in CI will fail the build if any refactor introduces a hidden allocation.
Key Takeaways
-
Memory profiling reveals the call sites responsible for allocations, total bytes allocated per site, and peak live bytes.
dhat(via thedhatcrate) provides this with minimal production overhead when gated behind a feature flag. -
A counting allocator wrapper (
#[global_allocator]with atomic counters) is the fastest way to count allocations in a specific code path. Use it to establish a baseline, then assert zero allocations in CI for hot paths. -
HashSet::with_capacity(n)andVec::with_capacity(n)pre-allocate to avoid grow-and-rehash allocations. If you know the expected size, always usewith_capacity. -
clear()retains the underlying allocation. Use it to reuseVecandHashMapbuffers across batches rather than dropping and reallocating each time. -
Common hidden allocation sources:
format!(),HashMap::new()without capacity,Vec::collect()on unsized iterators, iterator adapters that buffer internally (.sorted(),.chunks()on non-slice iterators), andArc::new()in a per-frame code path. -
Profile allocations before optimising. The counting allocator tells you how many allocations happen. The flamegraph from Lesson 2 tells you where time is spent. Together they give a complete picture: is the bottleneck the allocation count, the allocator latency, or the subsequent memory access pattern?
Project — Meridian Control Plane Performance Audit
Module: Foundation — M06: Performance & Profiling Prerequisite: All three module quizzes passed (≥70%)
- Mission Brief
- Pipeline Under Audit
- Audit Procedure
- Expected Output
- Acceptance Criteria
- Hints
- Reference Implementation
- Reflection
Mission Brief
TO: Platform Engineering FROM: Mission Control Systems Lead CLASSIFICATION: UNCLASSIFIED // INTERNAL SUBJECT: RFC-0058 — Control Plane Performance Audit and Remediation
The telemetry processor built in Module 5 achieves 100,000 frames per second in isolation. When integrated with the full control plane pipeline — ground station TCP ingress, deduplication, sort, downstream forwarding — the integrated system runs at 71,000 frames per second, 29% below target.
Your task is to conduct a structured performance audit of the integrated pipeline, identify the bottleneck using the tools from this module, implement a targeted fix with measurable improvement, and document the result.
Pipeline Under Audit
The pipeline processes frames through four stages:
[TCP Ingress] → [Validator] → [Deduplicator] → [Forwarder]
Each stage has a measurable input and output rate. Profiling tools tell you which stage is the bottleneck and which specific function within that stage consumes the most CPU.
Audit Procedure
Phase 1: Establish a Baseline with criterion
Write a criterion benchmark for the full pipeline (not just the processor). Measure:
- Frames per second through the complete pipeline
- Per-stage latency breakdown (validator, deduplicator, forwarder separately)
- Memory allocation count per batch (using a counting allocator)
The baseline establishes the starting point. Every fix must demonstrate measurable improvement against this baseline — not just "it felt faster".
Phase 2: CPU Profile with flamegraph
Run cargo flamegraph on the pipeline binary for 30 seconds under sustained load. Identify:
- Which stage occupies the most flamegraph width
- Which function within that stage is the hot leaf
- Whether the flamegraph shows
malloc/freeas significant contributors
Phase 3: Memory Profile with a Counting Allocator
Integrate the counting allocator from Lesson 3. For each batch of 1,000 frames:
- Count total allocations per batch
- Count allocations per stage (reset/snapshot around each stage)
- Identify which stage is responsible for the most allocations
Phase 4: Implement and Measure a Fix
Based on the profiling findings, implement the highest-impact fix. Typical candidates:
- Replace
Vec::new()in the deduplicator with a reused buffer (clear()pattern) - Replace
HashMap::new()withHashMap::with_capacity(batch_size) - Replace
format!()in the validator with a pre-allocated error buffer - Apply arena allocation for payloads that were missed in Module 5
Re-run the criterion benchmark. Document the before/after comparison.
Expected Output
A workspace with:
- A
meridian-pipelinebinary crate implementing the four-stage pipeline - A
benches/pipeline.rscriterion benchmark measuring the full pipeline and each stage - An
audit.mddocument recording:- Baseline criterion output (copy from terminal)
- Flamegraph findings (which function was the hot path)
- Allocation counts per stage per batch (from counting allocator)
- The fix implemented
- Post-fix criterion output showing improvement
criterion's statistical significance output (p-value)
Acceptance Criteria
| # | Criterion | Verifiable |
|---|---|---|
| 1 | criterion benchmark runs and produces confidence intervals for the full pipeline | Yes — cargo bench output |
| 2 | black_box applied correctly — input and output both wrapped | Yes — code review |
| 3 | Test data built outside the criterion closure, not inside | Yes — code review |
| 4 | Flamegraph generated for a ≥ 30-second profiling run | Yes — flamegraph.svg present |
| 5 | Allocation counts per stage documented in audit.md | Yes — numbers in the document |
| 6 | At least one measurable fix implemented and documented with before/after timing | Yes — audit.md |
| 7 | criterion reports p < 0.05 for the improvement (statistically significant) | Yes — criterion output in audit.md |
Hints
Hint 1 — Criterion benchmark structure
#![allow(unused)] fn main() { // benches/pipeline.rs // use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion}; // // fn bench_pipeline(c: &mut Criterion) { // let mut group = c.benchmark_group("pipeline"); // // for batch_size in [100, 500, 1000, 5000].iter() { // let headers = build_test_headers(*batch_size); // // group.bench_with_input( // BenchmarkId::new("full", batch_size), // batch_size, // |b, _| { // b.iter(|| { // black_box(run_pipeline(black_box(&headers))) // }) // }, // ); // } // group.finish(); // } // // criterion_group!(benches, bench_pipeline); // criterion_main!(benches); }
Hint 2 — Per-stage allocation counting
#![allow(unused)] fn main() { // Reset counter, run stage, snapshot: ALLOC_COUNT.store(0, Ordering::Relaxed); let result = run_validator(black_box(&frames)); let validator_allocs = ALLOC_COUNT.load(Ordering::Relaxed); ALLOC_COUNT.store(0, Ordering::Relaxed); let deduped = run_deduplicator(black_box(&result)); let dedup_allocs = ALLOC_COUNT.load(Ordering::Relaxed); println!("validator: {validator_allocs} allocs/batch"); println!("deduplicator: {dedup_allocs} allocs/batch"); }
Hint 3 — Reusing buffers between batches
If the deduplicator creates a new HashSet each batch, convert it to a persistent struct:
#![allow(unused)] fn main() { pub struct Deduplicator { seen: std::collections::HashSet<(u32, u64)>, unique_indices: Vec<usize>, } impl Deduplicator { pub fn new(expected_batch: usize) -> Self { Self { seen: std::collections::HashSet::with_capacity(expected_batch), unique_indices: Vec::with_capacity(expected_batch), } } pub fn process(&mut self, headers: &[(u32, u64)]) -> &[usize] { self.seen.clear(); // Retains allocation. self.unique_indices.clear(); // Retains allocation. for (i, &key) in headers.iter().enumerate() { if self.seen.insert(key) { self.unique_indices.push(i); } } &self.unique_indices } } }
Hint 4 — Flamegraph build configuration
Add to Cargo.toml:
[profile.release]
debug = true
[profile.profiling]
inherits = "release"
debug = true
Build and profile:
cargo build --profile profiling
cargo flamegraph --profile profiling --bin meridian-pipeline -- \
--duration 30 --batch-size 1000
If cargo flamegraph is not installed: cargo install flamegraph. Requires perf on Linux or Xcode instruments on macOS.
Reference Implementation
Reveal reference implementation
// src/main.rs — pipeline implementation for profiling use std::alloc::{GlobalAlloc, Layout, System}; use std::sync::atomic::{AtomicU64, Ordering::Relaxed}; use std::hint::black_box; use std::time::Instant; // --- Counting allocator --- struct CountingAllocator; static ALLOC_COUNT: AtomicU64 = AtomicU64::new(0); unsafe impl GlobalAlloc for CountingAllocator { unsafe fn alloc(&self, layout: Layout) -> *mut u8 { ALLOC_COUNT.fetch_add(1, Relaxed); System.alloc(layout) } unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) { System.dealloc(ptr, layout) } } #[global_allocator] static ALLOCATOR: CountingAllocator = CountingAllocator; // --- Pipeline stages --- #[inline(never)] fn validate(headers: &[(u32, u64, u8)]) -> Vec<(u32, u64)> { headers.iter() .filter(|&&(_, _, flags)| flags & 0x80 == 0) .map(|&(sat, seq, _)| (sat, seq)) .collect() } pub struct Deduplicator { seen: std::collections::HashSet<(u32, u64)>, indices: Vec<usize>, } impl Deduplicator { pub fn new(cap: usize) -> Self { Self { seen: std::collections::HashSet::with_capacity(cap), indices: Vec::with_capacity(cap), } } #[inline(never)] pub fn process(&mut self, valid: &[(u32, u64)]) -> &[usize] { self.seen.clear(); self.indices.clear(); for (i, &key) in valid.iter().enumerate() { if self.seen.insert(key) { self.indices.push(i); } } &self.indices } } #[inline(never)] fn forward(valid: &[(u32, u64)], unique: &[usize]) -> usize { unique.iter().map(|&i| valid[i].0 as usize).sum() } fn run_pipeline( headers: &[(u32, u64, u8)], dedup: &mut Deduplicator, ) -> usize { let valid = validate(headers); let unique = dedup.process(&valid).to_vec(); forward(&valid, &unique) } fn main() { let batch_size = 1_000usize; let headers: Vec<(u32, u64, u8)> = (0..batch_size) .map(|i| ((i % 48) as u32, (i / 3) as u64, 0u8)) .collect(); let mut dedup = Deduplicator::new(batch_size); // Warm up. for _ in 0..10 { run_pipeline(&headers, &mut dedup); } // Measure allocations per batch. ALLOC_COUNT.store(0, Relaxed); for _ in 0..1000 { black_box(run_pipeline(black_box(&headers), &mut dedup)); } let allocs = ALLOC_COUNT.load(Relaxed); println!("allocs across 1000 batches: {allocs}"); println!("allocs per batch: {:.1}", allocs as f64 / 1000.0); // Throughput measurement. let batches = 100_000u32; let start = Instant::now(); for _ in 0..batches { black_box(run_pipeline(black_box(&headers), &mut dedup)); } let elapsed = start.elapsed(); let fps = (batches as usize * batch_size) as f64 / elapsed.as_secs_f64(); println!("throughput: {:.0} frames/sec", fps); println!("elapsed: {:.2?}", elapsed); }
Reflection
The audit methodology in this project — baseline, profile, identify, fix, verify — is the standard performance engineering workflow. The workflow is the skill, not the specific tools. perf and flamegraph will be replaced by better tools; the habit of measuring before and after, asserting statistical significance, and documenting findings will not.
The counting allocator CI assertion from Lesson 3 is the instrument that keeps the improvements from this module from being silently regressed six months from now. Every performance optimisation needs a regression test. For throughput, that test is a criterion baseline stored in target/criterion. For allocation-freedom, it is a assert_eq!(allocs, 0) assertion in the CI pipeline.
With Module 6 complete, the full Foundation track is done. Every capability the control plane relies on — async scheduling, concurrency primitives, message passing, networking, data layout, and performance measurement — is now in your toolkit. The track-specific modules (Database Internals, Data Pipelines, Data Lakes, Distributed Systems) build directly on this foundation.
Module 01 — Storage Engine Fundamentals
Track: Database Internals — Orbital Object Registry
Position: Module 1 of 6
Source material: Database Internals — Alex Petrov, Chapters 1–4; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3
Quiz pass threshold: 70% on all three lessons to unlock the project
Mission Context
SDA INCIDENT REPORT — OOR-2026-0041
Classification: OPERATIONAL DEFICIENCY
Subject: TLE index query latency exceeding conjunction avoidance SLAESA's Space Surveillance and Tracking (SST) division has notified Meridian Space Systems that our Two-Line Element (TLE) index cannot scale past 100,000 tracked orbital objects. Current architecture stores TLE records as serialized JSON blobs in PostgreSQL — every conjunction query triggers a full table scan. With the post-fragmentation debris field projected to add 12,000 new objects this quarter, the system will exceed the 500ms conjunction query SLA within 60 days.
Directive: Build a purpose-built storage engine for the Orbital Object Registry. Start with the lowest layer — how bytes hit disk and come back.
This module establishes the foundational layer of the Orbital Object Registry storage engine. Before you can index, query, or recover data, you need a reliable on-disk format and an efficient way to move pages between disk and memory. Every decision made here — page size, record layout, eviction policy — propagates upward through the entire engine.
Learning Outcomes
After completing this module, you will be able to:
- Design a fixed-size page format with headers, magic bytes, and checksums for integrity verification
- Implement a buffer pool that caches hot pages in memory and evicts cold pages using LRU or CLOCK policies
- Explain why random I/O is the dominant cost in storage engines and how page-aligned access patterns reduce it
- Implement a slotted page layout that supports variable-length records with in-page compaction
- Reason about the tradeoffs between page size, I/O amplification, and internal fragmentation
- Map TLE records to a binary page format suitable for the Orbital Object Registry
Lesson Summary
Lesson 1 — File Formats and Page Layout
How storage engines organize bytes on disk. Fixed-size pages, headers, magic bytes, and the page abstraction that separates logical records from physical storage. Why 4KB or 8KB pages align with OS and hardware boundaries.
Key question: Why do storage engines use fixed-size pages instead of variable-length records written sequentially?
Lesson 2 — Buffer Pool Management
The page cache that sits between the storage engine and the OS. LRU and CLOCK eviction policies, page pinning, dirty page tracking, and the flush protocol. Why the buffer pool exists even though the OS has its own page cache.
Key question: When should a storage engine bypass the OS page cache and manage its own buffer pool?
Lesson 3 — Slotted Pages
How to store variable-length records within a fixed-size page. The slot array, free space pointer, and in-page compaction. How deletions create fragmentation and how the engine reclaims space without rewriting the entire page.
Key question: How does a slotted page maintain stable record identifiers when records are moved during compaction?
Capstone Project — TLE Record Page Manager
Build a page manager that reads and writes orbital TLE records to a custom binary page format backed by a simple buffer pool. The page manager must support insert, lookup by slot, delete, and page-level compaction. Acceptance criteria and the full project brief are in project-tle-page-manager.md.
File Index
module-01-storage-engine-fundamentals/
├── README.md ← this file
├── lesson-01-page-layout.md ← File formats and page layout
├── lesson-01-quiz.toml ← Quiz (5 questions)
├── lesson-02-buffer-pool.md ← Buffer pool management
├── lesson-02-quiz.toml ← Quiz (5 questions)
├── lesson-03-slotted-pages.md ← Slotted pages
├── lesson-03-quiz.toml ← Quiz (5 questions)
└── project-tle-page-manager.md ← Capstone project brief
Prerequisites
- Foundation Track completed (all 6 modules)
- Familiarity with
std::fs::File,Read,Write,Seektraits - Basic understanding of how operating systems manage file I/O
What Comes Next
Module 2 (B-Tree Index Structures) builds on the page abstraction from this module. The B-tree nodes you implement in Module 2 are stored in the pages you design here. The buffer pool you build here is the same buffer pool that serves page requests for the B-tree and, later, the LSM engine.
Lesson 1 — File Formats and Page Layout
Module: Database Internals — M01: Storage Engine Fundamentals
Position: Lesson 1 of 3
Source: Database Internals — Alex Petrov, Chapters 1–3; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3
Source note: This lesson was synthesized from training knowledge. The following concepts would benefit from verification against the source books: exact page header field sizes in Petrov Ch. 3, the magic byte conventions across SQLite/InnoDB/RocksDB, and Petrov's specific framing of the page abstraction layer.
Context
Every storage engine eventually answers the same question: how do bytes get from memory to disk and back? The answer is almost never "write them sequentially and hope for the best." Sequential writes are fast, but random reads against an unstructured file are catastrophic — seeking to an arbitrary byte offset in a 10GB file on a spinning disk costs 5–10ms per seek. Even on SSDs, random 512-byte reads are an order of magnitude slower than reading aligned 4KB blocks.
The solution, used by virtually every production storage engine from SQLite to RocksDB to PostgreSQL, is the page abstraction: divide the file into fixed-size blocks (pages), give each page a numeric identifier, and build all higher-level structures — indices, records, free lists — on top of this uniform unit. Pages align with the OS virtual memory system (typically 4KB) and the storage device's block size, which means reads and writes hit the hardware at its natural granularity.
For the Orbital Object Registry, each TLE record is approximately 140 bytes (two 69-character lines plus metadata). A single 4KB page can hold roughly 25–28 TLE records. With 100,000 tracked objects, the entire catalog fits in approximately 4,000 pages — about 16MB. The page format you design in this lesson is the physical foundation that every subsequent module builds on.
Core Concepts
The Page Abstraction
A page is a fixed-size block of bytes — the atomic unit of I/O in a storage engine. The engine never reads or writes less than one page. This constraint seems wasteful (reading 4KB to retrieve a 140-byte TLE record), but it aligns with how hardware and operating systems actually work:
- Disk drives read and write in sectors (512 bytes or 4KB for modern drives). Reading 1 byte costs the same as reading 4KB — the drive fetches the entire sector regardless.
- The OS page cache manages memory in 4KB pages. A storage engine that uses the same page size gets free alignment with the kernel's caching layer.
mmapand direct I/O both operate on page-aligned boundaries. Misaligned reads require the kernel to fetch extra pages and copy the relevant bytes — an unnecessary overhead.
Common page sizes: 4KB (SQLite, default PostgreSQL), 8KB (PostgreSQL configurable, InnoDB), 16KB (InnoDB default), 64KB (some OLAP systems). Larger pages reduce the number of I/O operations for sequential scans but increase I/O amplification for point lookups (you read 64KB to get 140 bytes). The Orbital Object Registry uses 4KB pages — the catalog is small enough that point lookup amplification matters more than scan throughput.
Page Layout
Every page begins with a header that identifies the page and describes its contents. The header is the first thing the engine reads after loading a page from disk, and it must contain enough information to interpret the rest of the page without external context.
A minimal page header contains:
| Field | Size | Purpose |
|---|---|---|
| Magic bytes | 4 bytes | Identifies this as a valid OOR page (e.g., 0x4F4F5231 = "OOR1") |
| Page ID | 4 bytes | Unique identifier for this page within the file |
| Page type | 1 byte | Discriminant: data page, index page, overflow page, free page |
| Record count | 2 bytes | Number of active records in this page |
| Free space offset | 2 bytes | Byte offset where free space begins |
| Checksum | 4 bytes | CRC32 of the page body for integrity verification |
Total header: 17 bytes. The remaining 4,079 bytes (in a 4KB page) are available for records.
Magic bytes serve two purposes: they let the engine detect corrupted or misidentified files on open (if the first 4 bytes of the file aren't OOR1, this isn't an OOR database), and they enable file-level identification by external tools (file command, hex editors). Production systems like SQLite use SQLite format 3\000 as the first 16 bytes of the file header.
Checksums detect bit rot and partial writes. A page whose checksum doesn't match its body was either corrupted on disk or partially written during a crash. The engine must reject it and attempt recovery from the WAL (Module 4). CRC32 is standard; some engines use xxHash for speed or SHA-256 for cryptographic integrity.
File Organization
The database file is a contiguous sequence of pages. Page 0 is typically a file header page that stores metadata: database version, page size, total page count, pointer to the free list head, and engine configuration. Pages 1 through N hold data.
┌──────────┬──────────┬──────────┬──────────┬─────┐
│ Page 0 │ Page 1 │ Page 2 │ Page 3 │ ... │
│ (header) │ (data) │ (data) │ (free) │ │
└──────────┴──────────┴──────────┴──────────┴─────┘
↑
File header: version, page size, page count,
free list head → Page 3
Addressing: Given a page ID and the page size, the byte offset in the file is page_id * page_size. This arithmetic is the reason pages must be fixed-size — variable-size pages would require a separate index to locate each page, adding a layer of indirection to every I/O operation.
Free list management: When a page is deallocated (all records deleted, or a B-tree node is merged), it goes on the free list rather than being returned to the OS. The next allocation request takes a page from the free list before extending the file. This avoids filesystem fragmentation and keeps the file size stable under delete-heavy workloads.
Alignment and Direct I/O
When a storage engine bypasses the OS page cache (using O_DIRECT on Linux), all reads and writes must be aligned to the device's block size — typically 512 bytes or 4KB. Misaligned I/O fails with EINVAL. Even when not using direct I/O, aligned access avoids read-modify-write cycles in the kernel's page cache.
In Rust, allocating page-aligned buffers requires care. Vec<u8> does not guarantee alignment beyond the default allocator's alignment (typically 8 or 16 bytes). For direct I/O, you need explicit alignment:
use std::alloc::{alloc, dealloc, Layout};
/// Allocate a page-aligned buffer for direct I/O.
/// Safety: caller must ensure `page_size` is a power of two.
fn alloc_aligned_page(page_size: usize) -> *mut u8 {
let layout = Layout::from_size_align(page_size, page_size)
.expect("page_size must be a power of two");
// Safety: layout is valid (non-zero size, power-of-two alignment)
unsafe { alloc(layout) }
}
Production engines wrap this in a PageBuf type that handles allocation, deallocation, and provides safe access to the underlying bytes.
Code Examples
Defining the Page Format for Orbital TLE Records
The Orbital Object Registry needs a page format that can store TLE records with their associated metadata. This example defines the page header, serialization, and deserialization logic.
use std::io::{self, Read, Write, Seek, SeekFrom};
use std::fs::File;
const PAGE_SIZE: usize = 4096;
const MAGIC: [u8; 4] = [0x4F, 0x4F, 0x52, 0x31]; // "OOR1"
const HEADER_SIZE: usize = 17;
/// Page types in the Orbital Object Registry.
#[repr(u8)]
#[derive(Debug, Clone, Copy, PartialEq)]
enum PageType {
FileHeader = 0,
Data = 1,
Index = 2,
Free = 3,
Overflow = 4,
}
/// Fixed-size page header. Sits at byte 0 of every page.
#[derive(Debug)]
struct PageHeader {
magic: [u8; 4],
page_id: u32,
page_type: PageType,
record_count: u16,
free_space_offset: u16,
checksum: u32,
}
impl PageHeader {
fn new(page_id: u32, page_type: PageType) -> Self {
Self {
magic: MAGIC,
page_id,
page_type,
record_count: 0,
// Free space starts immediately after the header
free_space_offset: HEADER_SIZE as u16,
checksum: 0,
}
}
fn serialize(&self, buf: &mut [u8]) {
buf[0..4].copy_from_slice(&self.magic);
buf[4..8].copy_from_slice(&self.page_id.to_le_bytes());
buf[8] = self.page_type as u8;
buf[9..11].copy_from_slice(&self.record_count.to_le_bytes());
buf[11..13].copy_from_slice(&self.free_space_offset.to_le_bytes());
buf[13..17].copy_from_slice(&self.checksum.to_le_bytes());
}
fn deserialize(buf: &[u8]) -> io::Result<Self> {
if buf[0..4] != MAGIC {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
"invalid page magic bytes — not an OOR page",
));
}
Ok(Self {
magic: MAGIC,
page_id: u32::from_le_bytes(buf[4..8].try_into().unwrap()),
page_type: match buf[8] {
0 => PageType::FileHeader,
1 => PageType::Data,
2 => PageType::Index,
3 => PageType::Free,
4 => PageType::Overflow,
_ => return Err(io::Error::new(
io::ErrorKind::InvalidData,
"unknown page type discriminant",
)),
},
record_count: u16::from_le_bytes(buf[9..11].try_into().unwrap()),
free_space_offset: u16::from_le_bytes(buf[11..13].try_into().unwrap()),
checksum: u32::from_le_bytes(buf[13..17].try_into().unwrap()),
})
}
}
Notice that all multi-byte integers use little-endian encoding (to_le_bytes/from_le_bytes). This is a deliberate choice — the engine should produce the same on-disk format regardless of the host architecture. Big-endian is equally valid (and simplifies key comparison in B-trees, as we'll see in Module 2), but you must pick one and enforce it everywhere. Mixing endianness across page types is a subtle bug that survives unit tests and explodes in production.
Reading and Writing Pages to Disk
The page I/O layer translates between page IDs and file offsets. This is the lowest layer of the storage engine — everything above it thinks in pages, not bytes.
/// Low-level page I/O against the database file.
struct PageFile {
file: File,
page_size: usize,
}
impl PageFile {
fn open(path: &str, page_size: usize) -> io::Result<Self> {
let file = File::options()
.read(true)
.write(true)
.create(true)
.open(path)?;
Ok(Self { file, page_size })
}
/// Read a page from disk into the provided buffer.
/// The buffer must be exactly `page_size` bytes.
fn read_page(&mut self, page_id: u32, buf: &mut [u8]) -> io::Result<()> {
assert_eq!(buf.len(), self.page_size);
let offset = page_id as u64 * self.page_size as u64;
self.file.seek(SeekFrom::Start(offset))?;
self.file.read_exact(buf)?;
Ok(())
}
/// Write a page buffer to disk at the correct offset.
fn write_page(&mut self, page_id: u32, buf: &[u8]) -> io::Result<()> {
assert_eq!(buf.len(), self.page_size);
let offset = page_id as u64 * self.page_size as u64;
self.file.seek(SeekFrom::Start(offset))?;
self.file.write_all(buf)?;
// Note: we do NOT fsync here. Durability is the WAL's job (Module 4).
// Calling fsync on every page write would destroy throughput —
// a single fsync costs 1-10ms on SSD, 10-30ms on spinning disk.
Ok(())
}
/// Allocate a new page at the end of the file. Returns the new page ID.
fn allocate_page(&mut self) -> io::Result<u32> {
let file_len = self.file.seek(SeekFrom::End(0))?;
let page_id = (file_len / self.page_size as u64) as u32;
let zeroed = vec![0u8; self.page_size];
self.file.write_all(&zeroed)?;
Ok(page_id)
}
}
Two things to notice: first, read_page uses read_exact, not read. A short read (fewer bytes than page_size) means the file is truncated or corrupted — the engine must not silently accept a partial page. Second, write_page does not call fsync. This is intentional. The WAL (Module 4) provides durability guarantees; the page file relies on the WAL for crash recovery. Calling fsync on every page write would reduce throughput from thousands of pages/second to fewer than 100 on spinning disk.
Computing and Verifying Page Checksums
Every page is checksummed before being written to disk. On read, the checksum is verified before the page contents are trusted. This catches bit rot, partial writes, and storage firmware bugs.
/// CRC32 checksum of the page body (everything after the checksum field).
/// We zero the checksum field before computing so the checksum is
/// deterministic regardless of the previous checksum value.
fn compute_checksum(page_buf: &[u8]) -> u32 {
// Checksum covers bytes 17..PAGE_SIZE (the body).
// The header's checksum field (bytes 13..17) is excluded from the
// computation — it stores the result.
let body = &page_buf[HEADER_SIZE..];
crc32fast::hash(body)
}
fn write_page_with_checksum(
page_file: &mut PageFile,
page_id: u32,
buf: &mut [u8],
) -> io::Result<()> {
let checksum = compute_checksum(buf);
buf[13..17].copy_from_slice(&checksum.to_le_bytes());
page_file.write_page(page_id, buf)
}
fn read_and_verify_page(
page_file: &mut PageFile,
page_id: u32,
buf: &mut [u8],
) -> io::Result<()> {
page_file.read_page(page_id, buf)?;
let stored = u32::from_le_bytes(buf[13..17].try_into().unwrap());
let computed = compute_checksum(buf);
if stored != computed {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
format!(
"page {} checksum mismatch: stored={:#010x}, computed={:#010x}",
page_id, stored, computed
),
));
}
Ok(())
}
The checksum covers only the page body, not the header's checksum field itself. This avoids a circular dependency: you can't include the checksum in the data being checksummed. Some engines (e.g., PostgreSQL) checksum the entire page with the checksum field zeroed before computation — both approaches work, but you must document which one you use.
Key Takeaways
- The page is the atomic unit of I/O in a storage engine. All reads and writes operate on full pages, never partial pages. This aligns with hardware block sizes and OS page cache granularity.
- Page size is a tradeoff: larger pages reduce I/O count for scans but increase amplification for point lookups. 4KB is the default for OLTP-style workloads like the Orbital Object Registry.
- Every page starts with a header containing magic bytes, page ID, type discriminant, and a checksum. The header must be self-describing — the engine should be able to interpret any page without external context.
- Byte order must be fixed across the entire on-disk format. Pick little-endian or big-endian and enforce it everywhere. Never rely on native endianness.
- Page writes do not call
fsync. Durability is provided by the write-ahead log, not by synchronous page flushes. This is a fundamental architectural decision that separates high-throughput engines from naive implementations.
Lesson 2 — Buffer Pool Management
Module: Database Internals — M01: Storage Engine Fundamentals
Position: Lesson 2 of 3
Source: Database Internals — Alex Petrov, Chapter 5; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3
Source note: This lesson was synthesized from training knowledge. The following concepts would benefit from verification against the source books: Petrov's specific CLOCK algorithm variant, his framing of the buffer pool vs. OS page cache tradeoff, and the exact dirty page flush protocols described in Chapter 5.
Context
The page I/O layer from Lesson 1 reads and writes pages directly to disk. Every read_page call triggers a system call, a disk seek (on HDD) or a flash translation layer lookup (on SSD), and a DMA transfer. For the Orbital Object Registry's conjunction query workload — which repeatedly accesses the same hot set of TLE records during a pass window — going to disk for every page request is unacceptable. A conjunction check against 100 objects would issue 100+ page reads, taking 50–100ms on SSD and over a second on spinning disk.
The buffer pool solves this by caching recently-accessed pages in memory. It sits between the storage engine's upper layers (B-tree, LSM, query processor) and the page I/O layer. When a page is requested, the buffer pool checks whether it's already in memory. If so, it returns a pointer to the cached copy — no disk I/O required. If not, it evicts a cold page to make room, reads the requested page from disk, caches it, and returns it. For a well-tuned buffer pool with a hot working set that fits in memory, the hit rate exceeds 99%, and the storage engine operates almost entirely from RAM.
The buffer pool exists even though the OS already has a page cache. The difference is control: the OS page cache uses a generic LRU policy that doesn't know about access patterns specific to the storage engine (sequential scan flooding, index traversal locality). A purpose-built buffer pool can use workload-aware eviction, pin pages during multi-step operations, and track dirty pages for coordinated flushing.
Core Concepts
Buffer Pool Architecture
The buffer pool is a fixed-size array of frames — each frame holds one page-sized buffer plus metadata. The metadata tracks:
- Page ID — which on-disk page is currently loaded in this frame
- Pin count — how many active references exist to this frame. A pinned page cannot be evicted.
- Dirty flag — whether the page has been modified since it was loaded from disk. Dirty pages must be written back before eviction.
- Reference bit (for CLOCK) — whether the page has been accessed recently
The buffer pool also maintains a page table — a hash map from page ID to frame index — for O(1) lookups. When the engine requests page 42, the buffer pool checks the page table. If page 42 maps to frame 7, the engine gets a reference to frame 7's buffer. If page 42 is not in the page table, the buffer pool must evict a page and load 42 from disk.
Page Table (HashMap<PageId, FrameId>)
┌─────────┬─────────┐
│ Page 42 │ Frame 7 │
│ Page 13 │ Frame 2 │
│ Page 99 │ Frame 0 │
│ ... │ ... │
└─────────┴─────────┘
Frame Array
┌─────────┬─────────┬─────────┬─────────┐
│ Frame 0 │ Frame 1 │ Frame 2 │ Frame 3 │ ...
│ pg=99 │ (empty) │ pg=13 │ (empty) │
│ pin=1 │ │ pin=0 │ │
│ dirty=N │ │ dirty=Y │ │
└─────────┴─────────┴─────────┴─────────┘
Eviction Policies: LRU
Least Recently Used (LRU) evicts the page that hasn't been accessed for the longest time. The intuition: if a page hasn't been needed recently, it's unlikely to be needed soon. LRU is implemented with a doubly-linked list — on every access, the page is moved to the head of the list. On eviction, the tail page is removed.
LRU's weakness is scan flooding: a single sequential scan over the entire database evicts every hot page from the buffer pool, even if those pages are accessed hundreds of times per second by other queries. After the scan completes, every subsequent request misses the buffer pool and goes to disk. This is catastrophic for the OOR — a full catalog export scan would evict the conjunction query hot set.
Mitigations: LRU-K (track the K-th most recent access, not just the most recent), 2Q (separate queues for first-access and re-access pages), or ARC (adaptive replacement cache). PostgreSQL uses a clock-sweep approximation. MySQL/InnoDB uses a two-list LRU with a "young" and "old" sublist.
Eviction Policies: CLOCK
CLOCK approximates LRU without the overhead of maintaining a linked list. Each frame has a single reference bit. When a page is accessed, its reference bit is set to 1. When the buffer pool needs to evict, it sweeps through the frames in circular order (like a clock hand):
- If the current frame's reference bit is 1, clear it to 0 and advance the hand.
- If the current frame's reference bit is 0 and the page is not pinned and not dirty (or if dirty, flush it first), evict this page.
CLOCK is cheaper per operation than LRU (no linked list manipulation on every access — just set a bit) and provides comparable hit rates for most workloads. It's the default in many systems.
The weakness is the same as LRU: a full scan sets every reference bit to 1, requiring the clock hand to sweep the entire pool before any page can be evicted. CLOCK-sweep with a scan-resistant enhancement (used by PostgreSQL) mitigates this by not setting the reference bit for pages accessed during a sequential scan.
Page Pinning
A page is pinned when the engine is actively using it and it must not be evicted. The pin count tracks how many concurrent users hold a reference to the page. A page is evictable only when pin_count == 0.
Pinning is critical for correctness: if the engine is in the middle of reading a B-tree node and the buffer pool evicts that page, the engine reads garbage. The protocol:
- Fetch a page: buffer pool loads or finds it, increments pin count, returns a handle.
- Use the page: engine reads or writes the page data.
- Unpin the page: engine decrements pin count when done. If the engine modified the page, it marks it dirty.
Failing to unpin a page is a resource leak — the page can never be evicted, and eventually the buffer pool fills with pinned pages and all fetch requests fail. In Rust, RAII handles this naturally: the page handle decrements the pin count in its Drop implementation.
Dirty Page Flushing
A dirty page has been modified in memory but not yet written back to disk. The buffer pool tracks dirty pages and flushes them in two scenarios:
- Eviction flush: when a dirty page is selected for eviction, it must be written to disk before the frame can be reused.
- Background flush: a periodic background thread scans for dirty pages and writes them to disk proactively, reducing the chance that an eviction will stall on a synchronous write.
The buffer pool does not call fsync after every flush. Durability is the WAL's responsibility (Module 4). The buffer pool's flush is an optimization — it keeps the page file reasonably up-to-date so that crash recovery doesn't have to replay the entire WAL.
Code Examples
A Simple LRU Buffer Pool for the Orbital Object Registry
This buffer pool caches OOR pages in memory and evicts the least recently used unpinned page when the pool is full.
use std::collections::{HashMap, VecDeque};
use std::io;
const PAGE_SIZE: usize = 4096;
/// Metadata for a single buffer pool frame.
struct Frame {
page_id: Option<u32>,
data: [u8; PAGE_SIZE],
pin_count: u32,
is_dirty: bool,
}
impl Frame {
fn new() -> Self {
Self {
page_id: None,
data: [0u8; PAGE_SIZE],
pin_count: 0,
is_dirty: false,
}
}
}
/// LRU buffer pool backed by the OOR page file.
struct BufferPool {
frames: Vec<Frame>,
page_table: HashMap<u32, usize>, // page_id -> frame_index
/// LRU order: front = most recently used, back = least recently used.
/// Contains frame indices. Only unpinned frames participate in LRU.
lru_list: VecDeque<usize>,
page_file: PageFile, // From Lesson 1
}
impl BufferPool {
fn new(num_frames: usize, page_file: PageFile) -> Self {
let mut frames = Vec::with_capacity(num_frames);
let mut lru_list = VecDeque::with_capacity(num_frames);
for i in 0..num_frames {
frames.push(Frame::new());
lru_list.push_back(i); // All frames start as free (evictable)
}
Self {
frames,
page_table: HashMap::new(),
lru_list,
page_file,
}
}
/// Fetch a page into the buffer pool. Returns the frame index.
/// The page is pinned — caller MUST call `unpin` when done.
fn fetch_page(&mut self, page_id: u32) -> io::Result<usize> {
// Fast path: page is already in the pool
if let Some(&frame_idx) = self.page_table.get(&page_id) {
self.frames[frame_idx].pin_count += 1;
self.move_to_front(frame_idx);
return Ok(frame_idx);
}
// Slow path: need to load from disk. Find an evictable frame.
let frame_idx = self.find_evict_target()?;
// If the frame holds a dirty page, flush it before reuse
if let Some(old_page_id) = self.frames[frame_idx].page_id {
if self.frames[frame_idx].is_dirty {
self.page_file.write_page(
old_page_id,
&self.frames[frame_idx].data,
)?;
}
self.page_table.remove(&old_page_id);
}
// Load the requested page into the frame
self.page_file.read_page(page_id, &mut self.frames[frame_idx].data)?;
self.frames[frame_idx].page_id = Some(page_id);
self.frames[frame_idx].pin_count = 1;
self.frames[frame_idx].is_dirty = false;
self.page_table.insert(page_id, frame_idx);
self.move_to_front(frame_idx);
Ok(frame_idx)
}
/// Unpin a page. Caller must indicate whether the page was modified.
fn unpin(&mut self, frame_idx: usize, is_dirty: bool) {
let frame = &mut self.frames[frame_idx];
assert!(frame.pin_count > 0, "unpin called on unpinned frame");
frame.pin_count -= 1;
if is_dirty {
frame.is_dirty = true;
}
}
/// Find the least recently used unpinned frame.
fn find_evict_target(&self) -> io::Result<usize> {
// Scan from the back (LRU end) for an unpinned frame
for &frame_idx in self.lru_list.iter().rev() {
if self.frames[frame_idx].pin_count == 0 {
return Ok(frame_idx);
}
}
Err(io::Error::new(
io::ErrorKind::Other,
"buffer pool exhausted: all frames are pinned",
))
}
/// Move a frame to the front of the LRU list (most recently used).
fn move_to_front(&mut self, frame_idx: usize) {
self.lru_list.retain(|&idx| idx != frame_idx);
self.lru_list.push_front(frame_idx);
}
}
The move_to_front implementation is O(n) because VecDeque::retain scans the entire list. A production buffer pool uses an intrusive doubly-linked list for O(1) LRU updates — Rust crates like intrusive-collections provide this. The O(n) approach is correct and sufficient for understanding the algorithm; the optimization matters only when the buffer pool has thousands of frames and fetch rates exceed 100k/sec.
Notice the pin-count assert in unpin: a double-unpin is a logic bug that must crash immediately in development. In production, this would be debug_assert! to avoid panicking on a user-facing code path.
RAII Page Handle for Automatic Unpinning
Rust's ownership system prevents the "forgot to unpin" bug class entirely. A page handle unpins automatically when it goes out of scope.
/// RAII handle to a pinned buffer pool page.
/// Automatically unpins the page when dropped.
struct PageHandle<'a> {
pool: &'a mut BufferPool,
frame_idx: usize,
dirty: bool,
}
impl<'a> PageHandle<'a> {
fn data(&self) -> &[u8; PAGE_SIZE] {
&self.pool.frames[self.frame_idx].data
}
fn data_mut(&mut self) -> &mut [u8; PAGE_SIZE] {
self.dirty = true;
&mut self.pool.frames[self.frame_idx].data
}
}
impl<'a> Drop for PageHandle<'a> {
fn drop(&mut self) {
self.pool.unpin(self.frame_idx, self.dirty);
}
}
This is one of the places where Rust's borrow checker provides a genuine advantage over C/C++ buffer pool implementations. In C, every code path that fetches a page must remember to unpin it — including error paths, early returns, and exception-like longjmp flows. In Rust, the Drop implementation runs unconditionally when the handle leaves scope. The borrow checker also prevents holding a &mut reference to the page data after the handle is dropped, which would alias freed memory in C.
The tradeoff: the &'a mut BufferPool borrow means you can only hold one PageHandle at a time with this design. A production buffer pool uses Arc<Mutex<...>> or unsafe interior mutability to allow multiple concurrent page handles — we'll revisit this pattern when we implement B-tree traversal in Module 2.
Key Takeaways
- The buffer pool is a fixed-size array of page-sized frames with a hash map for O(1) page-to-frame lookup. It eliminates disk I/O for hot pages and is the single largest performance lever in any storage engine.
- LRU eviction is simple but vulnerable to scan flooding. CLOCK approximates LRU at lower cost per operation. Production engines use hybrid policies (LRU-K, 2Q, ARC) to resist scan-induced cache pollution.
- Page pinning prevents eviction during active use. In Rust, RAII handles make pin leaks impossible — the
Dropimplementation guarantees unpinning on all code paths, including panics. - Dirty pages are flushed on eviction and by background threads. The buffer pool does not call
fsync— durability is the WAL's job. - The "all frames pinned" error means the buffer pool is undersized for the workload's concurrency level. In the OOR, this can happen during peak conjunction checking if every active query holds a page pin simultaneously.
Lesson 3 — Slotted Pages
Module: Database Internals — M01: Storage Engine Fundamentals
Position: Lesson 3 of 3
Source: Database Internals — Alex Petrov, Chapter 3; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3
Source note: This lesson was synthesized from training knowledge. The following concepts would benefit from verification against the source books: Petrov's specific slotted page layout, his terminology for the slot directory vs. cell pointer array, and the compaction algorithm described in Chapter 3.
Context
The page format from Lesson 1 stores records at fixed offsets. This works for fixed-size records — and if every TLE record were exactly 140 bytes, that would be sufficient. But real TLE data is messier: newer objects have additional metadata fields (drag coefficients, maneuver flags, covariance matrices), older legacy records omit optional fields, and record sizes will grow as ESA adds new conjunction assessment data. A format that requires all records to be the same size either wastes space (padding every record to the maximum) or breaks when schema evolves.
The slotted page layout solves this by decoupling record identity from record position. Records are addressed by slot numbers, and a slot directory at the beginning of the page maps each slot to the record's actual byte offset and length within the page. Records grow from the end of the page toward the front, while the slot directory grows from the front toward the end. They meet in the middle — and when they collide, the page is full.
This layout is the standard for every major relational database (PostgreSQL, MySQL/InnoDB, SQLite) and many key-value stores. Understanding it is prerequisite for everything that follows: B-tree nodes (Module 2) are slotted pages, WAL log records reference slot IDs (Module 4), and MVCC version chains (Module 5) track records by their page-and-slot address.
Core Concepts
Slotted Page Layout
A slotted page has three regions:
┌──────────────────────────────────────────────┐
│ Page Header (17 bytes — from Lesson 1) │
├──────────────────────────────────────────────┤
│ Slot Directory (grows →) │
│ [slot 0: offset, len] [slot 1: offset, len] │
├──────────────────────────────────────────────┤
│ │
│ Free Space │
│ │
├──────────────────────────────────────────────┤
│ Records (grow ←) │
│ [record 1 data] [record 0 data] │
└──────────────────────────────────────────────┘
Slot directory: An array of (offset, length) pairs, one per record. Slot 0 is the first entry. Each entry is 4 bytes (2 bytes offset + 2 bytes length), supporting records up to 65,535 bytes and offsets within a 64KB page. For 4KB pages, this is more than sufficient.
Records: Stored at the end of the page, growing backward (toward lower offsets). The first record inserted goes at the very end of the page; subsequent records are placed just before the previous one.
Free space: The gap between the end of the slot directory and the start of the record region. As records are inserted, the free space shrinks from both sides. The page is full when slot_directory_end >= record_region_start.
Record Addressing: (PageId, SlotId)
Higher layers of the storage engine refer to records by a Record ID (RID): a (page_id, slot_id) pair. This identifier is stable — it doesn't change when records are moved within the page during compaction, because the slot directory is updated to reflect the new offset. External references (B-tree leaf pointers, index entries) store RIDs, not raw byte offsets.
This indirection is what makes the slotted page powerful: the engine can rearrange the physical layout of records within a page (to reclaim fragmented space) without invalidating any external references. The slot ID stays the same; only the offset in the slot directory changes.
When a record is deleted, its slot entry is marked as tombstoned (offset set to a sentinel like 0xFFFF) but not removed from the directory. Removing it would shift all subsequent slot IDs by one, invalidating every external reference to those slots. Tombstoned slots can be reused for future inserts.
Insertion
To insert a record of N bytes:
- Check if there is enough free space:
free_space >= N + 4(4 bytes for the new slot entry). - Find the next available slot. If there's a tombstoned slot, reuse it. Otherwise, append a new entry to the directory.
- Write the record at
record_region_start - N. - Update the slot entry with
(offset, N). - Update the page header: increment record count, adjust
free_space_offset.
If the free space check fails, the page is full. The caller must either split the page (in a B-tree) or allocate a new page (in a heap file).
Deletion and Fragmentation
Deleting a record tombstones its slot entry and marks the record's bytes as reclaimable. But it doesn't shift other records — doing so would change their offsets and require updating every other slot entry that points past the deleted record.
This creates internal fragmentation: there are N bytes of garbage between valid records. Over time, a page can have plenty of total free space but no contiguous block large enough for a new record.
Before delete:
[Header][Slot 0][Slot 1][Slot 2] [free] [Rec 2][Rec 1][Rec 0]
After deleting record 1:
[Header][Slot 0][TOMB ][Slot 2] [free] [Rec 2][DEAD ][Rec 0]
← gap here cannot be used
unless compacted →
Page Compaction
Compaction reclaims fragmented space by sliding all live records to the end of the page (closing the gaps left by deleted records) and updating their slot directory entries to reflect the new offsets. After compaction, all free space is contiguous.
The algorithm:
- Collect all live records (slot entries that are not tombstoned), sorted by their current offset in descending order.
- Starting from the end of the page, write each record contiguously.
- Update each slot entry with the new offset.
- Update the page header's
free_space_offset.
Compaction is an in-page operation — it never spills to disk or affects other pages. It runs when an insert fails due to fragmentation (total free space is sufficient, but contiguous free space is not). Some engines compact proactively during quiet periods to avoid stalling inserts.
Overflow Pages
A single record might exceed the page's usable space (4,079 bytes in a 4KB page). This shouldn't happen for TLE records (140 bytes), but the engine must handle it for forward compatibility — covariance matrices and conjunction assessment reports can be kilobytes.
The solution: when a record is too large, store the first portion in the primary page and the remainder in one or more overflow pages. The slot entry points to the in-page portion, which contains a pointer to the overflow chain. This is sometimes called TOAST (The Oversized Attribute Storage Technique) in PostgreSQL terminology.
For the Orbital Object Registry, overflow pages are unlikely but should be supported. The implementation can be deferred until the schema actually requires it.
Code Examples
A Slotted Page Implementation for TLE Records
This implements the core slotted page logic: insert, lookup, delete, and compaction.
const PAGE_SIZE: usize = 4096;
const HEADER_SIZE: usize = 17;
const SLOT_SIZE: usize = 4; // 2 bytes offset + 2 bytes length
const TOMBSTONE: u16 = 0xFFFF;
/// A slotted page that stores variable-length records.
struct SlottedPage {
data: [u8; PAGE_SIZE],
}
impl SlottedPage {
fn new(page_id: u32) -> Self {
let mut page = Self {
data: [0u8; PAGE_SIZE],
};
// Initialize header (simplified — reuse PageHeader from Lesson 1)
let mut header = PageHeader::new(page_id, PageType::Data);
header.serialize(&mut page.data);
page
}
/// Number of slots (including tombstoned ones).
fn slot_count(&self) -> u16 {
u16::from_le_bytes(self.data[9..11].try_into().unwrap())
}
fn set_slot_count(&mut self, count: u16) {
self.data[9..11].copy_from_slice(&count.to_le_bytes());
}
/// Byte offset where the record region begins (records grow downward).
fn record_region_start(&self) -> u16 {
u16::from_le_bytes(self.data[11..13].try_into().unwrap())
}
fn set_record_region_start(&mut self, offset: u16) {
self.data[11..13].copy_from_slice(&offset.to_le_bytes());
}
/// Read a slot directory entry.
fn read_slot(&self, slot_id: u16) -> (u16, u16) {
let base = HEADER_SIZE + (slot_id as usize) * SLOT_SIZE;
let offset = u16::from_le_bytes(
self.data[base..base + 2].try_into().unwrap()
);
let length = u16::from_le_bytes(
self.data[base + 2..base + 4].try_into().unwrap()
);
(offset, length)
}
fn write_slot(&mut self, slot_id: u16, offset: u16, length: u16) {
let base = HEADER_SIZE + (slot_id as usize) * SLOT_SIZE;
self.data[base..base + 2].copy_from_slice(&offset.to_le_bytes());
self.data[base + 2..base + 4].copy_from_slice(&length.to_le_bytes());
}
/// Free space available for new records + slot entries.
fn free_space(&self) -> usize {
let slot_dir_end = HEADER_SIZE + (self.slot_count() as usize) * SLOT_SIZE;
let rec_start = self.record_region_start() as usize;
if rec_start > slot_dir_end {
rec_start - slot_dir_end
} else {
0
}
}
/// Insert a record. Returns the slot ID on success.
fn insert(&mut self, record: &[u8]) -> Option<u16> {
let needed = record.len() + SLOT_SIZE; // record bytes + new slot entry
if self.free_space() < needed {
return None; // Page full — caller should try compaction or new page
}
// Find a tombstoned slot to reuse, or allocate a new one
let slot_id = self.find_free_slot();
// Place the record at the end of the record region
let new_offset = self.record_region_start() - record.len() as u16;
let start = new_offset as usize;
let end = start + record.len();
self.data[start..end].copy_from_slice(record);
// Update the slot directory
self.write_slot(slot_id, new_offset, record.len() as u16);
self.set_record_region_start(new_offset);
Some(slot_id)
}
/// Look up a record by slot ID. Returns None if the slot is
/// tombstoned or out of range.
fn get(&self, slot_id: u16) -> Option<&[u8]> {
if slot_id >= self.slot_count() {
return None;
}
let (offset, length) = self.read_slot(slot_id);
if offset == TOMBSTONE {
return None; // Deleted record
}
let start = offset as usize;
let end = start + length as usize;
Some(&self.data[start..end])
}
/// Delete a record by tombstoning its slot entry.
fn delete(&mut self, slot_id: u16) -> bool {
if slot_id >= self.slot_count() {
return false;
}
let (offset, _) = self.read_slot(slot_id);
if offset == TOMBSTONE {
return false; // Already deleted
}
self.write_slot(slot_id, TOMBSTONE, 0);
true
}
/// Find a tombstoned slot to reuse, or allocate a new one.
fn find_free_slot(&mut self) -> u16 {
let count = self.slot_count();
for i in 0..count {
let (offset, _) = self.read_slot(i);
if offset == TOMBSTONE {
return i;
}
}
// No tombstoned slots — extend the directory
self.set_slot_count(count + 1);
count
}
}
Key design decisions: the slot directory grows forward from the header, records grow backward from the end of the page, and the two regions meet in the middle. This maximizes usable space — there's no fixed boundary between "slot space" and "record space." A page with few large records uses most of its space for record data; a page with many small records uses more for the slot directory.
The insert method does not attempt compaction automatically. The caller is responsible for detecting "free space exists but is fragmented" and calling compact() before retrying. This keeps the insert path simple and predictable.
Page Compaction: Defragmenting Live Records
When deletes have fragmented the record region, compaction slides all live records together and reclaims the gaps.
impl SlottedPage {
/// Compact the page: slide all live records to the end,
/// eliminating gaps from deleted records.
fn compact(&mut self) {
let slot_count = self.slot_count();
// Collect live records: (slot_id, data_copy)
let mut live_records: Vec<(u16, Vec<u8>)> = Vec::new();
for i in 0..slot_count {
let (offset, length) = self.read_slot(i);
if offset != TOMBSTONE {
let start = offset as usize;
let end = start + length as usize;
live_records.push((i, self.data[start..end].to_vec()));
}
}
// Rewrite records contiguously from the end of the page
let mut cursor = PAGE_SIZE as u16;
for (slot_id, record) in &live_records {
cursor -= record.len() as u16;
let start = cursor as usize;
let end = start + record.len();
self.data[start..end].copy_from_slice(record);
self.write_slot(*slot_id, cursor, record.len() as u16);
}
self.set_record_region_start(cursor);
}
}
This implementation copies live records into a temporary Vec and writes them back. A more memory-efficient approach would sort records by offset and slide them in-place, but the copy approach is correct, simple, and fast enough for 4KB pages. The total data moved is at most 4,079 bytes — negligible compared to the cost of a single disk I/O.
After compaction, the page's free space is contiguous. An insert that failed before compaction (due to fragmentation) will succeed after it — assuming the total free space is sufficient.
Key Takeaways
- Slotted pages decouple record identity (slot ID) from physical position (byte offset). Records can be moved within the page without invalidating external references.
- The
(page_id, slot_id)Record ID is the stable address used by B-tree leaf nodes, index entries, and MVCC version chains. Every higher layer depends on this abstraction. - Deletions create internal fragmentation. Compaction reclaims fragmented space by sliding live records together — an in-page operation that touches no other pages.
- Tombstoning (not removing) deleted slot entries preserves slot ID stability. A removed slot would shift all subsequent IDs, breaking every external reference.
- The "free space" calculation must account for both record bytes and slot directory growth. An insert that appears to fit by record size alone may fail because there's no room for the new slot entry.
Project — TLE Record Page Manager
Module: Database Internals — M01: Storage Engine Fundamentals
Track: Orbital Object Registry
Estimated effort: 4–6 hours
SDA Incident Report — OOR-2026-0042
Classification: ENGINEERING DIRECTIVE
Subject: Prototype page manager for the Orbital Object RegistryRef: OOR-2026-0041 (TLE index latency deficiency)
The first deliverable in the OOR storage engine build is a page manager capable of reading and writing TLE records to a custom binary page format. This component sits at the bottom of the storage stack — every subsequent module builds on it. The page manager must demonstrate correct page layout, buffer pool caching, slotted page record management, and integrity verification via checksums.
- Objective
- TLE Record Format
- Acceptance Criteria
- Starter Structure
- Hints
- Reference Implementation
- What Comes Next
Objective
Build a PageManager that:
- Manages a database file composed of fixed-size 4KB pages
- Implements a buffer pool with LRU or CLOCK eviction
- Uses slotted pages for variable-length TLE record storage
- Verifies page integrity with CRC32 checksums on every read
- Supports insert, lookup by Record ID
(page_id, slot_id), delete, and page compaction
TLE Record Format
For this project, a TLE record is a byte blob with the following structure:
/// A Two-Line Element record for a tracked orbital object.
struct TleRecord {
/// NORAD catalog number (unique object ID, e.g., 25544 for ISS)
norad_id: u32,
/// International designator (e.g., "98067A")
intl_designator: [u8; 8],
/// Epoch year + fractional day (e.g., 24045.5 = Feb 14 2024, 12:00 UTC)
epoch: f64,
/// Mean motion (revolutions per day)
mean_motion: f64,
/// Eccentricity (dimensionless, 0–1)
eccentricity: f64,
/// Inclination (degrees)
inclination: f64,
/// Right ascension of ascending node (degrees)
raan: f64,
/// Argument of perigee (degrees)
arg_perigee: f64,
/// Mean anomaly (degrees)
mean_anomaly: f64,
/// Drag term (B* coefficient)
bstar: f64,
/// Element set number (for provenance tracking)
element_set: u16,
/// Revolution number at epoch
rev_number: u32,
}
Serialized size: 4 + 8 + (8 × 8) + 2 + 4 = 82 bytes. Use little-endian encoding for all fields. You may add a 2-byte record-length prefix if your slotted page implementation requires it.
Acceptance Criteria
-
Page I/O correctness. Pages are written to and read from a file at the correct offsets. A page written at
page_id * 4096is read back identically. -
Checksum verification. Every
read_pagecall computes a CRC32 over the page body and compares it to the stored checksum. A tampered page (any bit flipped in the body) is detected and returns an error. -
Buffer pool hit rate. Insert 200 TLE records across multiple pages, then read them back in the same order. The buffer pool (configured with 8 frames) should achieve a hit rate above 90% on the read pass. Print the hit/miss counts.
-
Slotted page insert and lookup. Insert 40 records into a single page. Look up each by its
(page_id, slot_id)and verify the data matches. -
Delete and compaction. Delete every other record (slots 0, 2, 4, ...). Verify that lookups to deleted slots return
None. Compact the page and verify that all remaining records are still accessible by their original slot IDs. -
Page full handling. Insert records until a page reports full. Verify that the failure is detected before corrupting any data. Allocate a new page and continue inserting.
-
Deterministic output. The program runs without external dependencies beyond
stdandcrc32fast. Output includes the buffer pool hit/miss stats and a summary of records inserted/read/deleted.
Starter Structure
tle-page-manager/
├── Cargo.toml
├── src/
│ ├── main.rs # Entry point: runs the acceptance criteria
│ ├── page.rs # PageHeader, SlottedPage, checksums
│ ├── buffer_pool.rs # BufferPool, Frame, eviction policy
│ ├── page_file.rs # PageFile: raw I/O to the database file
│ └── tle.rs # TleRecord serialization/deserialization
Hints
Hint 1 — Serializing TLE records
Use to_le_bytes() for each field and concatenate them into a Vec<u8>. For deserialization, slice the byte buffer at the known offsets and use from_le_bytes(). Do not use serde or bincode — the point of this project is to understand raw binary layout.
impl TleRecord {
fn serialize(&self) -> Vec<u8> {
let mut buf = Vec::with_capacity(82);
buf.extend_from_slice(&self.norad_id.to_le_bytes());
buf.extend_from_slice(&self.intl_designator);
buf.extend_from_slice(&self.epoch.to_le_bytes());
// ... remaining fields
buf
}
}
Hint 2 — Buffer pool sizing
With 200 TLE records at 82 bytes each and ~49 records per page (4,079 usable bytes / 82 bytes ≈ 49, minus slot overhead), you need approximately 5 pages. An 8-frame buffer pool can hold the entire working set — but only if pages aren't evicted prematurely. Make sure your LRU implementation correctly promotes re-accessed pages.
Hint 3 — Compaction correctness check
After compacting, iterate all slot IDs and verify:
- Live records return the same data as before compaction
- Tombstoned slots still return
None - The page's total free space increased (fragmentation reclaimed)
- The page's contiguous free space equals total free space (no more gaps)
Hint 4 — Checksum verification testing
To test checksum detection, write a valid page to disk, then flip a single bit in the page body using raw file I/O. Read the page back through the buffer pool and verify that it returns a checksum error, not corrupted data.
// Flip bit 0 of byte 20 in page 1
let offset = 1 * PAGE_SIZE + 20;
file.seek(SeekFrom::Start(offset as u64))?;
let mut byte = [0u8; 1];
file.read_exact(&mut byte)?;
byte[0] ^= 0x01; // flip lowest bit
file.seek(SeekFrom::Start(offset as u64))?;
file.write_all(&byte)?;
Reference Implementation
Reveal full reference implementation
The reference implementation is intentionally omitted for this project. The three lessons provide all the code building blocks — your job is to integrate them into a working system. If you get stuck:
- Start with
page_file.rs— get raw page I/O working first - Add
page.rs— implementPageHeaderandSlottedPagefrom Lesson 1 and 3 - Add
buffer_pool.rs— wrap the page file with caching from Lesson 2 - Add
tle.rs— serialization is straightforward byte manipulation - Wire them together in
main.rs— run each acceptance criterion sequentially
What Comes Next
The page manager you build here is used directly by Module 2. B-tree nodes are stored as slotted pages in the buffer pool. The (page_id, slot_id) Record ID becomes the leaf-node pointer format in the B+ tree index.
Module 02 — B-Tree Index Structures
Track: Database Internals — Orbital Object Registry
Position: Module 2 of 6
Source material: Database Internals — Alex Petrov, Chapters 2, 4–6; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3
Quiz pass threshold: 70% on all three lessons to unlock the project
Mission Context
SDA INCIDENT REPORT — OOR-2026-0043
Classification: PERFORMANCE DEFICIENCY
Subject: NORAD catalog ID lookups require full page scansThe page manager from Module 1 stores TLE records but provides no way to locate a specific record without scanning every page. A conjunction query for NORAD ID 25544 (ISS) currently reads all data pages sequentially — O(N) in the number of pages. With 100,000 tracked objects across ~2,000 data pages, a single point lookup takes 2–5ms. During a pass window with 500 conjunction checks per second, this saturates the I/O subsystem.
Directive: Build a B+ tree index over NORAD catalog IDs. Point lookups must be O(log N) in the number of records. Range scans over contiguous NORAD ID ranges must traverse only the relevant leaf pages.
The B-tree is the most widely deployed index structure in database systems. PostgreSQL, MySQL/InnoDB, SQLite, and most file systems use B-tree variants for ordered key lookups. This module covers the structure, invariants, and maintenance operations (splits and merges) that keep the tree balanced under insert and delete workloads.
Learning Outcomes
After completing this module, you will be able to:
- Describe the B-tree invariants (minimum fill factor, sorted keys, balanced height) and explain why they guarantee O(log N) lookups
- Implement node split and merge operations that maintain B-tree balance on insert and delete
- Distinguish between B-trees and B+ trees, and explain why B+ trees are preferred for range scans and disk-based storage
- Implement a B+ tree leaf-level linked list for efficient range scans over NORAD ID ranges
- Analyze the I/O cost of B-tree operations in terms of tree height and page size
- Integrate a B+ tree index with the page manager from Module 1
Lesson Summary
Lesson 1 — B-Tree Structure: Keys, Pointers, and Invariants
The B-tree data structure: internal nodes, leaf nodes, the fill factor invariant, and the guarantee of O(log N) height. How keys and child pointers are arranged within a node, and how the tree is traversed for point lookups.
Key question: What is the maximum height of a B-tree indexing 100,000 NORAD IDs with a branching factor of 200?
Lesson 2 — Node Splits and Merges
Maintaining B-tree balance under writes. How inserts cause node splits (bottom-up), how deletes cause node merges or redistributions, and how these operations propagate up the tree. The difference between eager and lazy merge strategies.
Key question: Can a single insert into a B-tree with height H cause more than H page writes?
Lesson 3 — B+ Trees and Range Scans
The B+ tree variant: all data in leaf nodes, internal nodes hold only separator keys, and leaf nodes are linked for sequential access. Why this layout is optimal for disk-based storage engines that need both point lookups and range scans.
Key question: Why do B+ trees outperform B-trees for range scans even when both have the same height?
Capstone Project — B+ Tree TLE Index Engine
Build a B+ tree index over NORAD catalog IDs that supports point lookups, range scans, inserts, and deletes. The index is backed by the page manager from Module 1 — each B+ tree node is a slotted page in the buffer pool. Full project brief in project-btree-index.md.
File Index
module-02-btree-index-structures/
├── README.md ← this file
├── lesson-01-btree-structure.md ← B-tree structure and invariants
├── lesson-01-quiz.toml ← Quiz (5 questions)
├── lesson-02-splits-merges.md ← Node splits and merges
├── lesson-02-quiz.toml ← Quiz (5 questions)
├── lesson-03-bplus-trees.md ← B+ trees and range scans
├── lesson-03-quiz.toml ← Quiz (5 questions)
└── project-btree-index.md ← Capstone project brief
Prerequisites
- Module 1 (Storage Engine Fundamentals) completed
- Understanding of slotted pages and the buffer pool
What Comes Next
Module 3 (LSM Trees & Compaction) introduces a fundamentally different index structure optimized for write-heavy workloads. The B+ tree you build here is read-optimized — every insert modifies pages in-place, which is expensive for high write throughput. The LSM tree takes the opposite approach: batch writes in memory and flush them to immutable files. Understanding both structures and their tradeoffs is essential for choosing the right approach for the OOR's workload.
Lesson 1 — B-Tree Structure: Keys, Pointers, and Invariants
Module: Database Internals — M02: B-Tree Index Structures
Position: Lesson 1 of 3
Source: Database Internals — Alex Petrov, Chapters 2 and 4
Source note: This lesson was synthesized from training knowledge. The following concepts would benefit from verification against the source books: Petrov's specific notation for B-tree order vs. branching factor, and his framing of the fill factor invariant.
Context
A heap file of slotted pages provides O(1) access by Record ID (page_id, slot_id), but O(N) access by key — finding NORAD ID 25544 requires scanning every data page. For the Orbital Object Registry, this is the difference between a 0.05ms indexed lookup and a 5ms full scan. At 500 conjunction checks per second, indexed lookups consume 25ms of I/O per second. Full scans consume 2,500ms — the system spends more time scanning than computing.
The B-tree is a balanced, sorted, multi-way tree optimized for disk-based storage. Each node occupies one page, and the tree's branching factor (number of children per node) is determined by how many keys fit in a page. A B-tree with a branching factor of 200 and 100,000 records has a height of 3 — any record can be found in at most 3 page reads. Compare this to a binary search tree, which would have height ~17 and require 17 page reads.
B-trees were invented in 1970 by Bayer and McCreight specifically for disk-based access patterns. Every modern relational database and most file systems use B-tree variants as their primary index structure.
Core Concepts
Tree Structure
A B-tree of order m has the following properties:
- Every internal node has at most
mchildren and at mostm - 1keys. - Every internal node (except the root) has at least
⌈m/2⌉children. - The root has at least 2 children (unless it is a leaf).
- All leaves are at the same depth.
- Keys within each node are sorted in ascending order.
The keys in an internal node serve as separators — they direct the search to the correct child. For a node with keys [K₁, K₂, K₃] and children [C₀, C₁, C₂, C₃]:
- All keys in subtree
C₀are< K₁ - All keys in subtree
C₁are≥ K₁and< K₂ - All keys in subtree
C₂are≥ K₂and< K₃ - All keys in subtree
C₃are≥ K₃
[30 | 60]
/ | \
[10|20] [40|50] [70|80|90]
/ | \ / | \ / | | \
... ... ... ... ... ... ... ... ... ...
Branching Factor and Tree Height
The branching factor determines how wide the tree is — and therefore how shallow. For the Orbital Object Registry:
- Page size: 4KB
- Key size: 4 bytes (NORAD ID as
u32) - Child pointer size: 4 bytes (page ID as
u32) - Node header overhead: ~20 bytes
Usable space per node: 4096 - 20 = 4076 bytes. Each key-pointer pair: 4 + 4 = 8 bytes. Maximum keys per node: 4076 / 8 ≈ 509. So the branching factor is approximately 510.
Tree height for N records with branching factor B: h = ⌈log_B(N)⌉ + 1 (counting from root to leaf, inclusive).
| Records | B=510 Height | Page Reads per Lookup |
|---|---|---|
| 100,000 | 2 | 2 |
| 1,000,000 | 2 | 2 |
| 100,000,000 | 3 | 3 |
With branching factor 510, the entire 100,000-record OOR catalog is reachable in 2 page reads (root + leaf). The root node is almost always cached in the buffer pool, so in practice most lookups require only 1 disk read (the leaf node).
Point Lookup Algorithm
To find key K:
- Start at the root node.
- Binary search the node's keys to find the correct child pointer.
- Follow the child pointer to the next level.
- Repeat until a leaf node is reached.
- Binary search the leaf node for K.
Each level requires one page read and one binary search. Binary search within a node is O(log m) comparisons — negligible compared to the page read cost.
Node Layout on Disk
Each B-tree node is stored as a page in the page file. The node layout within a page:
┌────────────────────────────────────────────┐
│ Node Header │
│ - node_type: u8 (Internal=0, Leaf=1) │
│ - key_count: u16 │
│ - parent_page_id: u32 (for split propagation) │
├────────────────────────────────────────────┤
│ Key-Pointer Pairs (for internal nodes): │
│ [child_0] [key_0] [child_1] [key_1] ... │
│ │
│ Key-Value Pairs (for leaf nodes): │
│ [key_0] [rid_0] [key_1] [rid_1] ... │
│ (RID = page_id + slot_id of data record) │
└────────────────────────────────────────────┘
Internal nodes store (child_page_id, key) pairs. Leaf nodes store (key, record_id) pairs where the record ID points to the actual TLE record in a data page (from Module 1).
The Fill Factor Invariant
The minimum fill requirement (at least ⌈m/2⌉ children per internal node) is what guarantees the tree stays balanced. Without it, degenerate deletions could produce a tree where one branch is much deeper than another, destroying the O(log N) guarantee.
The fill factor also ensures space efficiency — every node is at least half full, so the tree uses at most 2x the minimum space needed. In practice, B-trees maintain an average fill factor of ~67% (between the 50% minimum and 100% maximum), and bulk-loaded trees can achieve >90%.
Code Examples
B-Tree Node Representation
This defines the on-disk layout for B-tree nodes in the Orbital Object Registry, where keys are NORAD catalog IDs (u32) and values are Record IDs.
/// Record ID: a pointer to a TLE record in a data page.
#[derive(Debug, Clone, Copy, PartialEq)]
struct RecordId {
page_id: u32,
slot_id: u16,
}
/// A B-tree node stored in a single page.
#[derive(Debug)]
enum BTreeNode {
Internal(InternalNode),
Leaf(LeafNode),
}
#[derive(Debug)]
struct InternalNode {
page_id: u32,
/// Separator keys. `keys[i]` is the boundary between children[i] and children[i+1].
keys: Vec<u32>,
/// Child page IDs. `children.len() == keys.len() + 1`.
children: Vec<u32>,
}
#[derive(Debug)]
struct LeafNode {
page_id: u32,
/// Sorted key-value pairs. Keys are NORAD IDs, values are Record IDs.
keys: Vec<u32>,
values: Vec<RecordId>,
}
impl InternalNode {
/// Find the child page that could contain the given key.
fn find_child(&self, key: u32) -> u32 {
// Binary search for the first separator key > search key.
// The child to the left of that separator is the correct subtree.
let pos = self.keys.partition_point(|&k| k <= key);
self.children[pos]
}
}
impl LeafNode {
/// Point lookup: find the Record ID for a given NORAD ID.
fn find(&self, key: u32) -> Option<RecordId> {
match self.keys.binary_search(&key) {
Ok(idx) => Some(self.values[idx]),
Err(_) => None,
}
}
}
The partition_point method is the correct choice for internal node search — it finds the insertion point, which corresponds to the child that owns the search key's range. Using binary_search would be wrong: duplicate separator keys (from splits) would match incorrectly, and binary_search returns an arbitrary match when duplicates exist.
Traversal: Root-to-Leaf Lookup
A complete point lookup traverses from the root to a leaf, reading one page per level.
/// Look up a NORAD ID in the B-tree. Returns the Record ID if found.
fn btree_lookup(
root_page_id: u32,
key: u32,
buffer_pool: &mut BufferPool,
) -> io::Result<Option<RecordId>> {
let mut current_page_id = root_page_id;
loop {
let frame_idx = buffer_pool.fetch_page(current_page_id)?;
let page_data = buffer_pool.frame_data(frame_idx);
let node = deserialize_node(page_data)?;
// Unpin immediately — we've extracted the data we need.
// In a real implementation, we'd hold the pin during the
// search for concurrency safety (see Module 5).
buffer_pool.unpin(frame_idx, false);
match node {
BTreeNode::Internal(internal) => {
current_page_id = internal.find_child(key);
// Continue traversal — follow the child pointer
}
BTreeNode::Leaf(leaf) => {
return Ok(leaf.find(key));
}
}
}
}
This reads at most h pages where h is the tree height. For the OOR (100k records, branching factor ~510), h = 2. The root page is almost always cached in the buffer pool (it's accessed by every lookup), so the typical cost is 1 disk read — just the leaf page.
The comment about unpinning immediately is important: in a concurrent engine (Module 5), you'd hold the pin while searching to prevent the page from being evicted mid-traversal. For single-threaded Module 2, immediate unpin is safe and keeps the buffer pool available.
Key Takeaways
- A B-tree with branching factor B and N records has height O(log_B(N)). With B ≈ 500 (common for 4KB pages with small keys), a tree indexing 100 million records is only 4 levels deep.
- The fill factor invariant (nodes at least half full) guarantees balanced height and prevents degenerate trees. Splits and merges (Lesson 2) maintain this invariant.
- Internal nodes contain separator keys and child pointers. Leaf nodes contain the actual key-to-record-ID mapping. The search algorithm binary-searches within each node and follows pointers down the tree.
- The branching factor is determined by page size and key/pointer sizes. Larger pages or smaller keys mean a wider tree and fewer I/O operations per lookup.
- Root and upper-level internal nodes are almost always cached in the buffer pool, so the practical I/O cost of a lookup is usually just 1 page read (the leaf).
Lesson 2 — Node Splits and Merges
Module: Database Internals — M02: B-Tree Index Structures
Position: Lesson 2 of 3
Source: Database Internals — Alex Petrov, Chapters 4–5
Source note: This lesson was synthesized from training knowledge. Verify Petrov's specific split/merge algorithm variants and his treatment of lazy vs. eager rebalancing against Chapters 4–5.
Context
A B-tree that only grows (inserts, never deletes) would eventually have every node at 100% capacity. The next insert into a full node would fail — unless the tree can restructure itself. Node splitting is the mechanism: when a node overflows, it divides into two half-full nodes and promotes a separator key to the parent. This maintains the B-tree invariant (all leaves at the same depth) and keeps every node between half and completely full.
The reverse operation — node merging — handles deletions. When a node drops below the minimum fill factor (half full), it either borrows keys from a sibling or merges with a sibling. Without merging, a delete-heavy workload could leave the tree full of nearly-empty nodes, wasting space and degrading scan performance.
For the OOR, inserts happen when new objects are cataloged or existing TLEs are updated (≈1,000/day for routine operations, burst to 10,000+ during fragmentation events). Deletes happen when objects re-enter the atmosphere or are reclassified. The split/merge machinery ensures the index stays balanced through both workload patterns.
Core Concepts
Leaf Node Split
When an insert arrives at a full leaf node:
- Allocate a new leaf page from the page manager.
- Move the upper half of the keys (and their record IDs) to the new node.
- The median key becomes the separator — it is promoted to the parent internal node.
- The parent inserts the separator key with a pointer to the new node.
Before split (leaf full, max 4 keys):
Parent: [...| 30 |...]
|
Leaf: [10, 20, 30, 40] ← inserting 25
After split:
Parent: [...| 25 | 30 |...]
| |
Left: [10, 20] Right: [25, 30, 40]
The choice of median matters: promoting the middle key keeps both new nodes as close to half-full as possible, maximizing the number of inserts before the next split. Some implementations promote the first key of the right node instead — simpler but slightly less balanced.
Internal Node Split
If the parent is also full when receiving the promoted separator, the parent itself must split. This propagation continues upward until either a non-full ancestor is found or the root splits. A root split is the only operation that increases the tree's height:
- Split the root into two children.
- Create a new root with one separator key pointing to the two children.
- Tree height increases by 1.
Root splits are rare — for a B-tree with branching factor 510, the root doesn't split until it contains 509 keys, meaning the tree holds at least 260,000 records at height 2 before needing height 3.
The Cost of Splits
A single insert can trigger a cascade of splits from leaf to root. In the worst case (every ancestor is full), inserting one key causes h splits — one per level. Each split writes 2 pages (the original node and the new node) plus modifies the parent, so the worst-case write amplification is 2h + 1 page writes for one insert.
In practice, cascading splits are rare. The average cost of an insert is approximately 1.5 page writes: one for the leaf update and 0.5 for the amortized split cost (since splits happen once per ~B/2 inserts).
Deletion and Underflow
When a key is deleted from a leaf:
- Remove the key and its record ID from the leaf.
- If the leaf is still at least half full, done.
- If the leaf is below half full (underflow), rebalance.
Rebalancing options, tried in order:
- Redistribute from a sibling: If an adjacent sibling has more than the minimum number of keys, transfer one key from the sibling through the parent. This keeps both nodes at valid fill levels.
- Merge with a sibling: If both the underflowing node and its sibling are at minimum, merge them into one node and remove the separator from the parent.
Merge reduces the parent's key count by one. If the parent then underflows, the same process propagates upward. A merge at the root level (when the root has only one child) reduces the tree height by 1.
Redistribution vs. Merge
Redistribution (sibling has spare keys):
Parent: [...| 30 |...] Parent: [...| 25 |...]
| | → | |
Left: [10] Right: [25,30,40] Left: [10,25] Right: [30,40]
Merge (both at minimum):
Parent: [...| 30 |...] Parent: [...|...]
| | → |
Left: [10] Right: [30] Merged: [10,30]
Redistribution is preferred because it doesn't change the tree structure — no nodes are created or destroyed, no parent keys are removed. Merge is the fallback when redistribution isn't possible.
Lazy vs. Eager Rebalancing
Not all implementations rebalance immediately on underflow. Lazy rebalancing tolerates slightly-underfull nodes, deferring merges until a periodic compaction pass or until the node becomes completely empty. This reduces write amplification at the cost of slightly lower space efficiency and slightly higher scan costs (more nodes to traverse).
PostgreSQL's B-tree implementation, for example, does not merge on every delete — it marks deleted entries as "dead" and reclaims space during VACUUM. This is partly because merge operations require exclusive locks on multiple nodes, which would block concurrent readers.
For the OOR, lazy rebalancing is the pragmatic choice: the delete rate (~100/day for atmospheric re-entries) is low enough that occasional underfull nodes have negligible impact on scan performance.
Code Examples
Leaf Node Split
Splitting a full leaf node during insert, promoting the median key to the parent.
impl LeafNode {
/// Split this leaf and return (median_key, new_right_leaf).
/// After split, `self` retains the lower half of the keys.
fn split(&mut self, new_page_id: u32) -> (u32, LeafNode) {
let mid = self.keys.len() / 2;
// The median key is promoted to the parent as a separator
let median_key = self.keys[mid];
// Right half moves to the new node
let right_keys = self.keys.split_off(mid);
let right_values = self.values.split_off(mid);
let right = LeafNode {
page_id: new_page_id,
keys: right_keys,
values: right_values,
// In a B+ tree, link the leaves (see Lesson 3)
next_leaf: self.next_leaf,
};
// Update the left node's forward pointer to the new right sibling
self.next_leaf = Some(new_page_id);
(median_key, right)
}
}
split_off(mid) is the correct choice here — it takes the elements from index mid to the end in O(1) amortized time (it's a Vec truncation + ownership transfer). The left node retains elements [0..mid) and the right node gets [mid..]. The median key is promoted to the parent but also remains in the right leaf — in a B+ tree, leaf nodes hold all keys, and internal nodes hold copies as separators.
Inserting into a B-Tree with Split Propagation
A top-level insert that handles splits propagating up the tree.
/// Insert a key-value pair into the B+ tree.
/// If the root splits, creates a new root and increases tree height.
fn btree_insert(
root_page_id: &mut u32,
key: u32,
rid: RecordId,
buffer_pool: &mut BufferPool,
) -> io::Result<()> {
// Traverse to the leaf, collecting the path of ancestors
let (leaf_page_id, ancestors) = find_leaf_with_path(
*root_page_id, key, buffer_pool
)?;
// Attempt to insert into the leaf
let overflow = insert_into_leaf(leaf_page_id, key, rid, buffer_pool)?;
if let Some((promoted_key, new_page_id)) = overflow {
// Leaf split occurred — propagate up
propagate_split(
root_page_id, promoted_key, new_page_id,
&ancestors, buffer_pool
)?;
}
Ok(())
}
/// Propagate a split upward through the ancestors.
fn propagate_split(
root_page_id: &mut u32,
mut promoted_key: u32,
mut new_child_page_id: u32,
ancestors: &[u32], // page IDs from root to parent-of-leaf
buffer_pool: &mut BufferPool,
) -> io::Result<()> {
// Walk ancestors from bottom (parent of leaf) to top (root)
for &ancestor_page_id in ancestors.iter().rev() {
let overflow = insert_into_internal(
ancestor_page_id, promoted_key, new_child_page_id,
buffer_pool,
)?;
match overflow {
None => return Ok(()), // Ancestor had room — done
Some((key, page_id)) => {
promoted_key = key;
new_child_page_id = page_id;
// Continue propagating
}
}
}
// If we reach here, the root itself split.
// Create a new root pointing to the old root and the new child.
let new_root_page = buffer_pool.allocate_page()?;
let new_root = InternalNode {
page_id: new_root_page,
keys: vec![promoted_key],
children: vec![*root_page_id, new_child_page_id],
};
serialize_and_write_node(&new_root, buffer_pool)?;
*root_page_id = new_root_page;
Ok(())
}
The ancestor path is collected during the initial traversal. This avoids re-traversing the tree during split propagation, which would be both slower and incorrect under concurrent modifications (a problem we'll address in Module 5).
The root_page_id is passed as &mut u32 because a root split changes it — the old root becomes a child of the new root. In a production engine, the root page ID is stored in the file header page and updated atomically with WAL protection.
Key Takeaways
- Node splits maintain the B-tree invariant by dividing overfull nodes and promoting a separator key to the parent. Splits propagate upward; root splits are the only operation that increases tree height.
- The amortized cost of an insert is ~1.5 page writes. The worst case (full cascade) is 2h+1 writes but occurs rarely — once per ~B/2 inserts per level.
- Deletions trigger rebalancing when a node drops below half full. Redistribution (borrowing from a sibling) is preferred; merging is the fallback. Merges propagate upward like splits.
- Lazy rebalancing (deferring merges) reduces write amplification and lock contention at the cost of slightly underfull nodes. Most production B-tree implementations use some form of lazy deletion.
- The ancestor path must be collected during traversal for split propagation. Re-traversing the tree after a split is both slower and unsafe under concurrent access.
Lesson 3 — B+ Trees and Range Scans
Module: Database Internals — M02: B-Tree Index Structures
Position: Lesson 3 of 3
Source: Database Internals — Alex Petrov, Chapters 5–6; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3
Source note: This lesson was synthesized from training knowledge. Verify Petrov's treatment of B+ tree leaf linking, his comparison of B-tree vs. B+ tree I/O cost for range scans, and his coverage of prefix compression in Chapter 6.
Context
The B-tree from Lessons 1 and 2 stores key-value pairs in both internal and leaf nodes. This is correct and complete for point lookups, but it has a significant limitation for range scans: the data is distributed across all levels of the tree. Scanning NORAD IDs 40000–40500 requires traversing internal nodes to find the start, then potentially bouncing between internal and leaf nodes to collect all matching records.
The B+ tree variant solves this by separating concerns: internal nodes contain only separator keys and child pointers (they are a pure navigation structure), and all data records live in leaf nodes. Leaf nodes are linked into a doubly-linked list, so a range scan only needs one tree traversal (to find the starting leaf) followed by a sequential walk along the leaf chain.
This is why virtually every production relational database uses B+ trees, not plain B-trees, as their primary index structure. The leaf-level linked list turns range scans from O(N log N) (re-traversing from the root for each key) into O(log N + K) where K is the number of matching keys — a massive improvement for the OOR's conjunction query workload, which frequently scans orbital parameter ranges.
Core Concepts
B+ Tree vs. B-Tree
| Property | B-Tree | B+ Tree |
|---|---|---|
| Data location | All nodes (internal + leaf) | Leaf nodes only |
| Internal node contents | Keys + values + child pointers | Keys + child pointers only |
| Range scan support | Must re-traverse or backtrack | Sequential leaf walk |
| Branching factor | Lower (values consume space in internal nodes) | Higher (internal nodes hold more keys) |
| Point lookup I/O | Potentially fewer reads (data can be in internal nodes) | Always traverses to leaf |
| Disk space for keys | Each key stored once | Separator keys duplicated in internal nodes |
For the OOR, the B+ tree's higher branching factor (more keys per internal node) and efficient range scans outweigh the slight overhead of always traversing to leaf nodes. The conjunction query workload is dominated by range scans over orbital parameter ranges, not single-key lookups.
Leaf-Level Linked List
B+ tree leaf nodes are connected in a linked list sorted by key order. Each leaf stores a pointer to its right sibling (and optionally its left sibling for reverse scans):
Root: [30 | 60]
/ | \
/ | \
[10,20] → [30,40,50] → [60,70,80,90]
(leaf 1) (leaf 2) (leaf 3)
A range scan for keys 25–55:
- Tree traversal: root → leaf 2 (first key ≥ 25 is 30).
- Sequential scan: read leaf 2 (keys 30, 40, 50), follow
nextpointer to leaf 3 (key 60 > 55, stop). - Total I/O: 2 page reads (root + leaf 2) + 1 page read (leaf 3) = 3 pages. The root is cached, so practical I/O is 2 page reads.
Without the linked list, the engine would need to return to the root for each key, or implement a complex in-order traversal using a stack of parent pointers.
Separator Keys in Internal Nodes
In a B+ tree, internal node keys are separators — they do not need to be exact copies of leaf keys. They only need to correctly direct traffic. If the left child's maximum key is 29 and the right child's minimum key is 30, any value in the range [30, ...] works as a separator. Some implementations use abbreviated separators (the shortest key that correctly divides the two children) to fit more entries per internal node.
This means a delete from a leaf does not necessarily require updating the parent's separator. If you delete key 30 from a leaf, the separator in the parent can remain 30 — it still correctly directs traffic because all keys in the right child are ≥ 30 (the next key might be 31). The separator only needs updating if the split boundary itself changes.
Prefix and Suffix Compression
Leaf nodes in a B+ tree often contain keys with significant commonality — for example, NORAD IDs in the range 40000–40499 share the prefix "400". Prefix compression stores the common prefix once and encodes each key as a delta from the prefix. Suffix truncation removes key suffixes that are unnecessary for correct comparison within the node.
For the OOR's u32 NORAD IDs, these optimizations provide modest savings (4-byte keys have limited prefix sharing). They become critical for composite keys or string keys — a B+ tree indexing international designators like "2024-001A" through "2024-999Z" would benefit substantially.
Bulk Loading
Building a B+ tree by inserting records one at a time produces nodes at ~50-67% average fill. Bulk loading builds the tree bottom-up from sorted data:
- Sort all records by key.
- Pack records into leaf nodes at near-100% fill.
- Build internal nodes bottom-up by taking separator keys from each pair of adjacent leaves.
- Repeat upward until the root is created.
Bulk loading is O(N) in the number of records and produces an optimally-packed tree. For the OOR's initial catalog load (100,000 records), bulk loading fills ~200 leaf pages at 100% fill. One-by-one inserts would produce ~300-400 pages at 50-67% fill.
Code Examples
B+ Tree Leaf Node with Sibling Links
Extending the leaf node from Lesson 1 with linked-list pointers for range scan support.
/// B+ tree leaf node with sibling links for range scans.
#[derive(Debug)]
struct BPlusLeafNode {
page_id: u32,
keys: Vec<u32>,
values: Vec<RecordId>,
/// Forward pointer to the next leaf (right sibling)
next_leaf: Option<u32>,
/// Backward pointer for reverse scans (optional)
prev_leaf: Option<u32>,
}
impl BPlusLeafNode {
/// Range scan: iterate all keys in [start, end] starting from this leaf.
/// Returns a lazy iterator that follows the leaf chain.
fn range_scan_from<'a>(
&'a self,
start: u32,
end: u32,
buffer_pool: &'a mut BufferPool,
) -> impl Iterator<Item = io::Result<(u32, RecordId)>> + 'a {
// Find the first key >= start in this leaf
let start_idx = self.keys.partition_point(|&k| k < start);
// Yield matching keys from this leaf and follow the chain
BPlusRangeScanIterator {
current_keys: &self.keys[start_idx..],
current_values: &self.values[start_idx..],
idx: 0,
end_key: end,
next_leaf_page: self.next_leaf,
buffer_pool,
done: false,
}
}
}
The range scan iterator is lazy — it reads the next leaf page only when the current leaf's keys are exhausted. This avoids reading leaf pages beyond what the caller actually consumes (important if the caller stops early after finding a match).
Range Scan Iterator
The iterator follows the leaf chain, reading one page at a time.
/// Iterator over a range of B+ tree leaf entries.
/// Follows the leaf-level linked list until the end key is exceeded.
struct BPlusRangeScanIterator<'a> {
current_leaf: Option<BPlusLeafNode>,
idx: usize,
end_key: u32,
buffer_pool: &'a mut BufferPool,
done: bool,
}
impl<'a> BPlusRangeScanIterator<'a> {
fn next_entry(&mut self) -> Option<io::Result<(u32, RecordId)>> {
loop {
if self.done {
return None;
}
let leaf = self.current_leaf.as_ref()?;
// Check if there are more entries in the current leaf
if self.idx < leaf.keys.len() {
let key = leaf.keys[self.idx];
if key > self.end_key {
self.done = true;
return None;
}
let rid = leaf.values[self.idx];
self.idx += 1;
return Some(Ok((key, rid)));
}
// Current leaf exhausted — follow the chain
match leaf.next_leaf {
None => {
self.done = true;
return None;
}
Some(next_page_id) => {
match self.load_leaf(next_page_id) {
Ok(next_leaf) => {
self.current_leaf = Some(next_leaf);
self.idx = 0;
// Loop back to yield from the new leaf
}
Err(e) => {
self.done = true;
return Some(Err(e));
}
}
}
}
}
}
fn load_leaf(&mut self, page_id: u32) -> io::Result<BPlusLeafNode> {
let frame_idx = self.buffer_pool.fetch_page(page_id)?;
let data = self.buffer_pool.frame_data(frame_idx);
let leaf = deserialize_leaf_node(data)?;
self.buffer_pool.unpin(frame_idx, false);
Ok(leaf)
}
}
This pattern — a struct that holds iteration state and lazily loads pages — is the volcano iterator model that we'll formalize in Module 6 (Query Processing). Every B+ tree range scan in every database engine works this way: traverse to the starting leaf, then pull one record at a time from the leaf chain, fetching the next page only when the current one is exhausted.
Key Takeaways
- B+ trees store all data in leaf nodes and use internal nodes purely for navigation. This maximizes the internal node branching factor and enables efficient range scans via the leaf-level linked list.
- Range scans are O(log N + K): one tree traversal to the starting leaf, then K sequential leaf reads. This is the primary advantage over plain B-trees and hash indices.
- Separator keys in internal nodes don't need to be exact copies of leaf keys — any value that correctly directs traffic works. This enables prefix compression and abbreviated separators.
- Bulk loading produces a B+ tree at near-100% fill factor in O(N) time, compared to O(N log N) for one-by-one inserts at ~50-67% fill. Always bulk-load when building an index from scratch.
- The leaf-level scan iterator is the first instance of the volcano iterator pattern — a pull-based interface that lazily fetches pages on demand. This pattern recurs throughout the query processing stack.
Project — B+ Tree TLE Index Engine
Module: Database Internals — M02: B-Tree Index Structures
Track: Orbital Object Registry
Estimated effort: 6–8 hours
- SDA Incident Report — OOR-2026-0043
- Objective
- Acceptance Criteria
- Starter Structure
- Hints
- What Comes Next
SDA Incident Report — OOR-2026-0043
Classification: ENGINEERING DIRECTIVE
Subject: Build ordered index for NORAD catalog ID lookups and range scansThe page manager from Module 1 stores TLE records but requires full scans for key-based access. Build a B+ tree index over NORAD catalog IDs that provides O(log N) point lookups and efficient range scans via a leaf-level linked list. The index must integrate with the existing page manager and buffer pool.
Objective
Build a B+ tree index that:
- Uses the page manager and buffer pool from Module 1 — each B+ tree node is stored as a page
- Supports point lookups by NORAD catalog ID in O(log N) page reads
- Supports range scans over NORAD ID ranges using the leaf-level linked list
- Handles inserts with automatic node splitting and split propagation
- Handles deletes with tombstoning (lazy rebalancing is acceptable)
- Provides a bulk-load operation for initial catalog construction
Acceptance Criteria
-
Point lookup correctness. Insert 10,000 TLE records with random NORAD IDs. Look up each by ID and verify the returned Record ID matches.
-
Range scan correctness. Insert NORAD IDs 1–10,000. Scan range [5000, 5100] and verify exactly 101 records returned in sorted order.
-
Split handling. Insert records until at least 3 leaf splits occur. Verify the tree remains balanced (all leaves at same depth) and all records retrievable.
-
Bulk-load efficiency. Bulk-load 100,000 sorted records. Verify leaf fill factor above 95%. Compare leaf count to one-by-one insertion.
-
Delete correctness. Delete 1,000 records by NORAD ID. Verify lookups return
Nonefor deleted keys and remaining records are unaffected. -
Integration with buffer pool. Run full test suite with only 16 buffer pool frames. Verify correctness under eviction pressure.
-
Deterministic output. Print tree height, leaf count, fill factor, and buffer pool hit/miss stats after each test phase.
Starter Structure
btree-index/
├── Cargo.toml
├── src/
│ ├── main.rs # Entry point: runs acceptance criteria
│ ├── btree.rs # BPlusTree: insert, lookup, range_scan, delete, bulk_load
│ ├── node.rs # InternalNode, LeafNode: serialization, split, merge
│ ├── page.rs # Reuse from Module 1
│ ├── buffer_pool.rs # Reuse from Module 1
│ ├── page_file.rs # Reuse from Module 1
│ └── tle.rs # Reuse from Module 1
Hints
Hint 1 — Node serialization format
Use a 1-byte discriminant at the start of each node page to distinguish internal from leaf. Internal: [type=0][key_count][child_0][key_0][child_1].... Leaf: [type=1][key_count][next_leaf][prev_leaf][key_0][rid_0]....
Hint 2 — Ancestor path for split propagation
During root-to-leaf traversal, push each internal node's page ID onto a Vec<u32>. After a leaf split, pop ancestors one at a time to insert the promoted key. If the ancestor splits too, continue popping. Empty stack = create new root.
Hint 3 — Bulk-load algorithm
- Sort all records by NORAD ID.
- Pack keys into leaf nodes at capacity, write each, record its page ID and last key.
- Build internal nodes bottom-up: group separators into internal node pages.
- Repeat step 3 until one root remains.
- Link leaves into a doubly-linked list.
Hint 4 — Buffer pool pressure during splits
Split propagation can pin the split node, the new node, and the parent simultaneously (3 frames). With a 16-frame pool and a 2-level tree, this is safe. But unpin nodes as soon as you've serialized them back — don't hold all three longer than necessary.
What Comes Next
Module 3 introduces LSM trees — a fundamentally different approach. Where B+ trees update pages in-place, LSM trees batch writes in memory and flush immutable files. You'll understand when each is appropriate for the OOR workload.
Module 03 — LSM Trees & Compaction
Track: Database Internals — Orbital Object Registry
Position: Module 3 of 6
Source material: Database Internals — Alex Petrov, Chapter 7; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3; Mini-LSM Course — Alex Chi Z
Quiz pass threshold: 70% on all three lessons to unlock the project
Mission Context
SDA INCIDENT REPORT — OOR-2026-0044
Classification: PERFORMANCE DEFICIENCY
Subject: B+ tree write throughput insufficient for fragmentation event ingestionDuring the Cosmos-2251 debris cascade simulation, the OOR must ingest 12,000 new TLE records in under 60 seconds. The B+ tree index from Module 2 achieves ~200 inserts/second — each insert requires a root-to-leaf traversal and potential node split, generating 2-4 random page writes per insert. At this rate, ingesting 12,000 records takes 60 seconds, consuming the entire conjunction window.
Directive: Evaluate and implement an LSM-tree-based storage architecture. LSM trees batch writes in memory and flush them as immutable sorted files, converting random writes to sequential writes. The tradeoff: reads become more expensive (must check multiple files), but write throughput increases by 10–100x.
The LSM tree (Log-Structured Merge Tree) is the dominant architecture for write-heavy storage engines. RocksDB, LevelDB, Cassandra, HBase, CockroachDB, and TiKV all use LSM-tree variants. Where the B+ tree is optimized for read-heavy workloads with moderate writes, the LSM tree is optimized for write-heavy workloads where reads can tolerate checking multiple sorted structures.
This module covers the full LSM architecture: memtables, sorted string tables (SSTs), the write and read paths, compaction strategies, and read optimizations (bloom filters, block cache). It draws on the mini-lsm course structure and the LSM coverage in Database Internals Chapter 7.
Learning Outcomes
After completing this module, you will be able to:
- Describe the LSM write path (memtable → immutable memtable → SST flush) and explain why it converts random writes to sequential I/O
- Implement a memtable backed by a sorted data structure (e.g.,
BTreeMap) and flush it to an immutable SST file - Design an SST file format with data blocks, index blocks, and metadata blocks
- Explain the three amplification factors (read, write, space) and how compaction strategies trade between them
- Compare leveled, tiered, and FIFO compaction strategies and choose the appropriate strategy for a given workload
- Implement bloom filters and a block cache to reduce read amplification in an LSM engine
Lesson Summary
Lesson 1 — Memtables and Sorted String Tables
The LSM write path: how writes are batched in an in-memory memtable, frozen into immutable memtables, and flushed to sorted string table (SST) files on disk. The SST file format: data blocks, index blocks, bloom filter blocks, and footer. How the read path probes the memtable first, then SSTs from newest to oldest.
Key question: Why does the LSM tree maintain both a mutable and one or more immutable memtables instead of writing directly from the mutable memtable to disk?
Lesson 2 — Compaction Strategies
The core problem: as SSTs accumulate, reads slow down (more files to check) and space grows (deleted keys aren't reclaimed until compacted). Compaction merges SSTs to reduce read amplification and reclaim space, at the cost of write amplification. Leveled compaction, tiered (universal) compaction, and FIFO compaction — each trades differently between read, write, and space amplification.
Key question: Can you design a compaction strategy that minimizes all three amplification factors simultaneously?
Lesson 3 — Bloom Filters, Block Cache, and Read Optimization
LSM reads are expensive — they must check the memtable plus potentially every SST level. Bloom filters let the engine skip SSTs that definitely don't contain the target key. The block cache keeps hot SST data blocks in memory. Together, they reduce the effective read amplification from O(levels) to near O(1) for point lookups.
Key question: A bloom filter with a 1% false positive rate eliminates 99% of unnecessary SST reads. What is the cost of increasing it to 0.1%?
Capstone Project — LSM-Backed TLE Storage Engine
Build an LSM storage engine for the Orbital Object Registry that supports put, get, delete, and scan operations. The engine must implement memtable→SST flush, a basic leveled compaction strategy, and bloom filters for point lookup optimization. Full project brief in project-lsm-engine.md.
File Index
module-03-lsm-trees-compaction/
├── README.md ← this file
├── lesson-01-memtables-ssts.md ← Memtables and sorted string tables
├── lesson-01-quiz.toml ← Quiz (5 questions)
├── lesson-02-compaction.md ← Compaction strategies
├── lesson-02-quiz.toml ← Quiz (5 questions)
├── lesson-03-read-optimization.md ← Bloom filters, block cache, read path
├── lesson-03-quiz.toml ← Quiz (5 questions)
└── project-lsm-engine.md ← Capstone project brief
Prerequisites
- Module 1 (Storage Engine Fundamentals) — page I/O concepts
- Module 2 (B-Tree Index Structures) — understanding of B+ tree tradeoffs (to compare against)
What Comes Next
Module 4 (Write-Ahead Logging & Recovery) adds durability to the LSM engine. Currently, a crash loses all data in the memtable (which is in-memory only). The WAL ensures that every write is persisted before being acknowledged, and the recovery process replays the WAL to rebuild the memtable after a crash.
Lesson 1 — Memtables and Sorted String Tables
Module: Database Internals — M03: LSM Trees & Compaction
Position: Lesson 1 of 3
Source: Database Internals — Alex Petrov, Chapter 7; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3; Mini-LSM Week 1
Source note: This lesson was synthesized from training knowledge and the Mini-LSM course structure. Verify Petrov's specific SSTable format description and Kleppmann's LSM compaction cost analysis against the source texts.
Context
The B+ tree from Module 2 provides O(log N) reads but suffers under write-heavy workloads: every insert modifies a leaf page in place, and node splits amplify writes further. For the Orbital Object Registry's burst ingestion scenario (8,000 new objects in 45 minutes), the B+ tree's write amplification of 10–20x is the bottleneck.
The Log-Structured Merge Tree (LSM) eliminates in-place updates entirely. All writes go to an in-memory sorted structure called a memtable. When the memtable reaches a size threshold, it is frozen (becomes immutable) and flushed to disk as a Sorted String Table (SSTable) — a file of sorted key-value pairs that is never modified after creation. Reads probe the memtable first, then search SSTables from newest to oldest.
This architecture makes writes trivially fast: inserting a key-value pair is a single in-memory operation (an insert into a skip list or B-tree in RAM). The cost is shifted to reads, which must check multiple SSTables, and to background compaction, which merges SSTables to keep read amplification bounded. The entire LSM design is a bet that write throughput matters more than read latency for many workloads — and for TLE burst ingestion, it does.
Core Concepts
The Memtable
The memtable is a mutable, in-memory sorted data structure that buffers incoming writes. Common implementations:
- Skip list (used by LevelDB, RocksDB): O(log N) insert and lookup, lock-free concurrent reads, good cache behavior. The standard choice.
- Red-black tree or B-tree in memory: O(log N) operations, but harder to make lock-free.
- Sorted vector: O(N) insert (shift), O(log N) lookup. Only viable for very small memtables.
For the OOR, a BTreeMap<Vec<u8>, Vec<u8>> is the simplest correct implementation. Production engines use skip lists for concurrent access, but the algorithm is the same: maintain sorted order in memory, flush to disk when full.
Key operations:
- Put(key, value): Insert or update a key in the memtable. O(log N).
- Delete(key): Insert a tombstone — a special marker that indicates the key has been deleted. The tombstone must persist through SSTables so that older versions of the key are masked.
- Get(key): Look up a key. Returns the value, or the tombstone if deleted, or None if the key is not in the memtable.
The tombstone design is critical: a delete cannot simply remove the key from the memtable, because older SSTables on disk may still contain the key. Without a tombstone, a read would miss the memtable (key not present), then find the old value in an SSTable and incorrectly return it.
Memtable Freeze and Flush
When the memtable reaches its size threshold (typically 4–64MB), the engine:
- Freezes the current memtable — it becomes immutable (no more writes).
- Creates a new active memtable for incoming writes.
- In the background, flushes the immutable memtable to disk as an SSTable.
The freeze-then-flush pattern ensures writes are never blocked by disk I/O. The only latency-sensitive operation is the in-memory insert. Multiple immutable memtables can exist simultaneously (queued for flush), but each consumes memory, so the engine must flush faster than new memtables are created.
Write path:
Put(key, value)
│
▼
Active Memtable (mutable, in-memory)
│ (size threshold reached)
▼
Immutable Memtable (frozen, in-memory)
│ (background flush)
▼
SSTable on disk (sorted, immutable)
SSTable Format
An SSTable is a file containing sorted key-value pairs organized into data blocks. The standard layout:
┌─────────────────────────────────────────────┐
│ Data Block 0: [k0:v0, k1:v1, k2:v2, ...] │
│ Data Block 1: [k3:v3, k4:v4, k5:v5, ...] │
│ ... │
│ Data Block N: [kM:vM, ...] │
├─────────────────────────────────────────────┤
│ Meta Block: Bloom filter (optional) │
├─────────────────────────────────────────────┤
│ Index Block: [block_0_last_key → offset, │
│ block_1_last_key → offset, │
│ ... ] │
├─────────────────────────────────────────────┤
│ Footer: index_offset, meta_offset, magic │
└─────────────────────────────────────────────┘
Data blocks (typically 4KB each) contain sorted key-value pairs. Keys within a block can use prefix compression — store the shared prefix once and encode each key as a delta.
Index block maps the last key of each data block to the block's file offset. A point lookup binary-searches the index block to find which data block might contain the key, then searches within that block.
Meta block stores auxiliary data — most importantly, a bloom filter for fast negative lookups (Lesson 3).
Footer is the last few bytes of the file, containing offsets to the index and meta blocks. The reader starts by reading the footer, then uses its offsets to locate everything else.
SSTables are immutable — once written, they are never modified. Updates and deletes are handled by writing new SSTables that supersede older ones. This immutability is the source of LSM's concurrency simplicity: readers can access any SSTable without locks (the file doesn't change under them), and the only coordination needed is between the flush/compaction writers and the metadata that tracks which SSTables are active.
The Read Path
To read a key from an LSM engine:
- Check the active memtable. If found, return immediately.
- Check each immutable memtable from newest to oldest.
- Check each SSTable from newest to oldest (L0, then L1, then L2, ...).
- If the key is not found anywhere, it does not exist.
At each level, finding a tombstone means the key was deleted — stop searching and return "not found." This is why tombstones must be ordered correctly: a tombstone in a newer SSTable masks the key's value in all older SSTables.
The worst case is a negative lookup (key doesn't exist): the engine must check every memtable and every SSTable before concluding the key is absent. This is where bloom filters (Lesson 3) provide the biggest win — they let the engine skip entire SSTables in O(1) per filter check.
The Merge Iterator
Range scans (and compaction) require merging sorted streams from multiple sources — the memtable and several SSTables. The merge iterator (also called a multi-way merge) takes N sorted iterators and produces a single sorted stream:
- Maintain a min-heap of the current key from each source.
- Pop the smallest key. If multiple sources have the same key, take the one from the newest source (the memtable or the most recent SSTable).
- Advance the source that produced the popped key.
The newest-wins rule is what makes updates and deletes work correctly: a newer value for the same key supersedes the older one, and a newer tombstone masks the older value.
Code Examples
A Simple Memtable Backed by BTreeMap
The memtable stores key-value pairs in sorted order. Tombstones are represented as None values.
use std::collections::BTreeMap;
/// Memtable: in-memory sorted store for LSM writes.
/// Keys are byte slices. Values are `Option<Vec<u8>>` where
/// `None` represents a tombstone (deleted key).
struct MemTable {
map: BTreeMap<Vec<u8>, Option<Vec<u8>>>,
size_bytes: usize,
size_limit: usize,
}
impl MemTable {
fn new(size_limit: usize) -> Self {
Self {
map: BTreeMap::new(),
size_bytes: 0,
size_limit,
}
}
/// Insert or update a key-value pair.
fn put(&mut self, key: Vec<u8>, value: Vec<u8>) {
let entry_size = key.len() + value.len();
self.size_bytes += entry_size;
self.map.insert(key, Some(value));
}
/// Mark a key as deleted (insert a tombstone).
fn delete(&mut self, key: Vec<u8>) {
let entry_size = key.len();
self.size_bytes += entry_size;
self.map.insert(key, None); // None = tombstone
}
/// Look up a key. Returns:
/// - Some(Some(value)) if the key exists
/// - Some(None) if the key is tombstoned (deleted)
/// - None if the key is not in this memtable
fn get(&self, key: &[u8]) -> Option<&Option<Vec<u8>>> {
self.map.get(key)
}
/// True if the memtable has reached its size limit and should be frozen.
fn should_flush(&self) -> bool {
self.size_bytes >= self.size_limit
}
/// Iterate all entries in sorted order for flushing to an SSTable.
fn iter(&self) -> impl Iterator<Item = (&Vec<u8>, &Option<Vec<u8>>)> {
self.map.iter()
}
}
The three-valued return from get is essential: None means "this memtable has no information about this key — keep searching older sources." Some(None) means "this key was deleted — stop searching." Some(Some(value)) means "here's the value." Collapsing the first two cases would make deleted keys reappear from older SSTables.
SSTable Builder: Flushing the Memtable to Disk
When the memtable is frozen, its entries are written to an SSTable file in sorted order.
use std::io::{self, Write, BufWriter};
use std::fs::File;
const BLOCK_SIZE: usize = 4096;
/// Builds an SSTable file from sorted key-value pairs.
struct SsTableBuilder {
writer: BufWriter<File>,
/// Index entries: (last_key_in_block, block_offset)
index: Vec<(Vec<u8>, u64)>,
current_block: Vec<u8>,
current_block_offset: u64,
entry_count: usize,
}
impl SsTableBuilder {
fn new(path: &str) -> io::Result<Self> {
let file = File::create(path)?;
Ok(Self {
writer: BufWriter::new(file),
index: Vec::new(),
current_block: Vec::new(),
current_block_offset: 0,
entry_count: 0,
})
}
/// Add a key-value pair. Keys must be added in sorted order.
fn add(&mut self, key: &[u8], value: Option<&[u8]>) -> io::Result<()> {
// Encode the entry: [key_len: u16][key][is_tombstone: u8][value_len: u16][value]
let is_tombstone = value.is_none();
let val = value.unwrap_or(&[]);
self.current_block.extend_from_slice(&(key.len() as u16).to_le_bytes());
self.current_block.extend_from_slice(key);
self.current_block.push(if is_tombstone { 1 } else { 0 });
self.current_block.extend_from_slice(&(val.len() as u16).to_le_bytes());
self.current_block.extend_from_slice(val);
self.entry_count += 1;
// If the block is full, flush it
if self.current_block.len() >= BLOCK_SIZE {
self.flush_block(key)?;
}
Ok(())
}
fn flush_block(&mut self, last_key: &[u8]) -> io::Result<()> {
let offset = self.current_block_offset;
self.writer.write_all(&self.current_block)?;
self.index.push((last_key.to_vec(), offset));
self.current_block_offset += self.current_block.len() as u64;
self.current_block.clear();
Ok(())
}
/// Finalize the SSTable: write the index block and footer.
fn finish(mut self) -> io::Result<()> {
// Flush any remaining data in the current block
if !self.current_block.is_empty() {
let last_key = self.index.last()
.map(|(k, _)| k.clone())
.unwrap_or_default();
self.flush_block(&last_key)?;
}
// Write the index block
let index_offset = self.current_block_offset;
for (key, offset) in &self.index {
self.writer.write_all(&(key.len() as u16).to_le_bytes())?;
self.writer.write_all(key)?;
self.writer.write_all(&offset.to_le_bytes())?;
}
// Write the footer: index_offset + magic
self.writer.write_all(&index_offset.to_le_bytes())?;
self.writer.write_all(b"OORSST01")?; // 8-byte magic
self.writer.flush()?;
Ok(())
}
}
The builder writes entries into fixed-size blocks and records the last key and offset of each block in the index. The footer at the end of the file lets the reader locate the index without scanning the entire file. This is the same layout used by LevelDB and RocksDB's table format (with more compression and filtering in production).
Notice that add requires keys in sorted order — the caller (the memtable flush code) is responsible for iterating the memtable in order. Violating this invariant produces a corrupt SSTable where binary search returns wrong results.
Key Takeaways
- The LSM write path is: active memtable → freeze → immutable memtable → background flush → SSTable on disk. Writes are never blocked by disk I/O — they complete as soon as the in-memory insert finishes.
- Deletes are tombstones, not removals. A tombstone must persist through SSTables to mask older versions of the key. Compaction eventually garbage-collects tombstones once no older version exists.
- SSTables are immutable sorted files partitioned into data blocks with an index block for O(log B) block lookup (where B is the number of blocks). Immutability enables lock-free concurrent reads.
- The read path checks memtable first, then SSTables from newest to oldest. Negative lookups (key doesn't exist) are the worst case — they must check every source. Bloom filters (Lesson 3) mitigate this.
- The merge iterator produces a single sorted stream from multiple sources, with newest-wins semantics for duplicate keys. This is the core data structure for both reads and compaction.
Lesson 2 — Compaction Strategies
Module: Database Internals — M03: LSM Trees & Compaction
Position: Lesson 2 of 3
Source: Database Internals — Alex Petrov, Chapter 7; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 3; Mini-LSM Week 2
Source note: This lesson was synthesized from training knowledge and the Mini-LSM course compaction chapters. Verify the specific amplification formulas against Petrov Chapter 7 and the RocksDB Tuning Guide.
Context
Without compaction, the LSM engine accumulates SSTables indefinitely. Every flush creates a new SSTable in Level 0 (L0). After 100 flushes, there are 100 L0 SSTables — and a point lookup must check all of them. Read amplification grows linearly with the number of SSTables. Space amplification grows too: deleted keys still consume space in older SSTables, and updated keys have multiple versions.
Compaction is the background process that merges SSTables to reduce read and space amplification. It reads multiple SSTables, merge-sorts their entries (applying tombstones and keeping only the newest version of each key), and writes the result as fewer, larger SSTables. The merged input files are then deleted.
The compaction strategy determines which SSTables to merge and when. Different strategies make different tradeoffs between three amplification factors:
- Read amplification (RA): How many SSTables must be checked for a single read. Lower RA = faster reads.
- Write amplification (WA): How many times each byte of user data is written to disk over its lifetime. Lower WA = faster ingestion.
- Space amplification (SA): How much extra disk space is used beyond the logical data size. Lower SA = less storage cost.
No strategy minimizes all three simultaneously — this is the fundamental tradeoff of LSM design. Choosing a strategy means choosing which factor to sacrifice for the workload at hand.
Core Concepts
Level 0 and the Flush Problem
When a memtable is flushed, it becomes an L0 SSTable. L0 is special: its SSTables have overlapping key ranges (each SSTable contains whatever keys were in the memtable at freeze time, which can be any subset of the key space). This means a point lookup at L0 must check every L0 SSTable — there's no way to narrow the search by key range.
All compaction strategies share a common first step: merge L0 SSTables into L1, where SSTables have non-overlapping key ranges. In L1 and below, a point lookup can determine which SSTable(s) to check based on key range alone (binary search on SSTable boundaries), reducing read amplification.
L0: [a-z] [b-y] [c-x] ← overlapping, must check all
L1: [a-f] [g-m] [n-z] ← non-overlapping, check at most 1
L2: [a-c] [d-f] [g-i] ...← non-overlapping, check at most 1
Leveled Compaction
Leveled compaction (default in RocksDB, used by LevelDB) organizes SSTables into levels with exponentially increasing size targets. Each level's total size is a fixed multiple (the size ratio, typically 10) of the level above:
- L1 target: 256MB
- L2 target: 2.56GB (10 × L1)
- L3 target: 25.6GB (10 × L2)
When a level exceeds its target size, the engine picks one SSTable from that level and merges it with all overlapping SSTables in the next level.
Key property: Within each level (L1+), SSTables have non-overlapping key ranges. This means at most one SSTable per level needs to be checked for a point lookup.
Amplification characteristics:
- Read amplification: O(L) where L is the number of levels. With size ratio 10, a 1TB database has ~4 levels → ~4 SSTable reads per lookup. Excellent.
- Write amplification: High — in the worst case, a single SSTable is rewritten once per level transition. With size ratio 10 and 4 levels, write amplification is approximately
10 × L = 40x. Each byte of data is rewritten ~10 times per level hop. - Space amplification: Low — each key has at most one copy per level, and compaction removes obsolete versions. Typically 1.1–1.2x.
Leveled compaction is optimal for read-heavy workloads with limited tolerance for space overhead — exactly the OOR's conjunction query workload after the initial bulk ingestion.
Tiered (Universal) Compaction
Tiered compaction (RocksDB's "Universal" mode, used by Cassandra) groups SSTables into tiers (sorted runs) of similar size. When the number of tiers at a size level reaches a threshold, they are merged into a single larger tier.
Key property: Each tier is a sorted run (non-overlapping internally), but different tiers can overlap with each other. A read must check one SSTable per tier.
Amplification characteristics:
- Read amplification: O(T) where T is the number of tiers. Worse than leveled because tiers overlap.
- Write amplification: Low — each compaction merges tiers of similar size, so each byte is rewritten fewer times. Typically 2–5x for well-configured tiering.
- Space amplification: High — during compaction, both the input tiers and the output tier exist simultaneously, requiring up to 2x the logical data size in temporary space. Permanent space amplification is also higher because multiple tiers may hold different versions of the same key.
Tiered compaction is optimal for write-heavy workloads where ingestion throughput matters more than read latency — exactly the OOR's burst ingestion during a fragmentation event.
FIFO Compaction
FIFO compaction simply deletes the oldest SSTable when total storage exceeds a threshold. No merging occurs. This is appropriate for time-series data with a natural expiration window — if the OOR only needs TLE records from the last 7 days, FIFO compaction automatically ages out older data.
Amplification characteristics:
- Read amplification: High — all SSTables persist until aged out.
- Write amplification: 1x — data is written once (the initial flush) and never rewritten.
- Space amplification: Bounded by the retention window.
FIFO is unsuitable for the OOR's catalog (which must retain all active objects indefinitely) but useful for the telemetry stream (which has natural time-based expiration).
Amplification Tradeoff Summary
| Strategy | Read Amp | Write Amp | Space Amp | Best For |
|---|---|---|---|---|
| Leveled | Low (O(L)) | High (~10L) | Low (~1.1x) | Read-heavy, space-sensitive |
| Tiered | Medium (O(T)) | Low (~2-5x) | High (~2x) | Write-heavy, burst ingestion |
| FIFO | High | Minimal (1x) | Time-bounded | TTL data, append-only streams |
The RocksDB wiki summarizes the tradeoff space clearly: it is generally impossible to minimize all three amplification factors simultaneously. The compaction strategy is a knob that slides between them.
Compaction Mechanics: The Merge Process
Regardless of strategy, the actual compaction operation is the same:
- Select input SSTables (determined by the strategy).
- Create a merge iterator over all input SSTables.
- For each key in sorted order:
- If multiple versions exist, keep only the newest.
- If the newest version is a tombstone and there are no older SSTables that might contain the key, drop the tombstone (garbage collection).
- Otherwise, write the entry to the output SSTable(s).
- Split output into new SSTables when they reach the target file size.
- Atomically update the LSM metadata to swap old SSTables for new ones.
- Delete the old input SSTables.
Step 5 is critical for crash safety — if the engine crashes during compaction, it must not lose data. Either the old SSTables or the new ones should be the active set, never a mix. This is typically handled by writing a manifest (metadata log) that records which SSTables are active, and updating it atomically (via rename or WAL).
Compaction Scheduling
Compaction runs in background threads and must be scheduled carefully:
- Too little compaction: Read amplification grows as uncompacted SSTables accumulate.
- Too much compaction: Write bandwidth is consumed by compaction, starving foreground writes.
- Compaction during peak load: The background I/O from compaction interferes with foreground query latency.
Production engines use rate limiting (RocksDB's rate_limiter) to cap compaction I/O bandwidth, and priority scheduling to defer compaction during high-load periods. The SILK paper (USENIX ATC '19) formalized this as a latency-aware compaction scheduler.
For the OOR, the practical guideline: run compaction aggressively during quiet periods (between pass windows) and throttle during conjunction query bursts.
Code Examples
A Simple Leveled Compaction Controller
This determines which SSTables to compact and when, based on level size targets.
/// Metadata for an SSTable in the LSM state.
#[derive(Debug, Clone)]
struct SstMeta {
id: u64,
level: usize,
size_bytes: u64,
min_key: Vec<u8>,
max_key: Vec<u8>,
}
/// LSM state: tracks all active SSTables by level.
struct LsmState {
/// L0 SSTables (overlapping key ranges, newest first)
l0_sstables: Vec<SstMeta>,
/// L1+ levels: each level is a vec of non-overlapping SSTables sorted by key range
levels: Vec<Vec<SstMeta>>,
/// Size ratio between adjacent levels (typically 10)
size_ratio: u64,
/// L1 target size in bytes
l1_target_bytes: u64,
}
/// A compaction task: which SSTables to merge and where to put the output.
struct CompactionTask {
input_ssts: Vec<SstMeta>,
output_level: usize,
}
impl LsmState {
/// Determine if compaction is needed and generate a task.
fn generate_compaction_task(&self) -> Option<CompactionTask> {
// Priority 1: Too many L0 SSTables (merge all L0 into L1)
if self.l0_sstables.len() >= 4 {
let mut inputs: Vec<SstMeta> = self.l0_sstables.clone();
// Include all L1 SSTables that overlap with any L0 SSTable
let l0_min = inputs.iter().map(|s| &s.min_key).min().unwrap().clone();
let l0_max = inputs.iter().map(|s| &s.max_key).max().unwrap().clone();
if let Some(l1) = self.levels.get(0) {
for sst in l1 {
if sst.max_key >= l0_min && sst.min_key <= l0_max {
inputs.push(sst.clone());
}
}
}
return Some(CompactionTask {
input_ssts: inputs,
output_level: 1,
});
}
// Priority 2: A level exceeds its target size
for (i, level) in self.levels.iter().enumerate() {
let level_num = i + 1; // levels[0] = L1
let target = self.l1_target_bytes * self.size_ratio.pow(i as u32);
let actual: u64 = level.iter().map(|s| s.size_bytes).sum();
if actual > target {
// Pick the SSTable with the most overlap in the next level
// (simplified: pick the first SSTable)
if let Some(sst) = level.first() {
let mut inputs = vec![sst.clone()];
// Add overlapping SSTables from the next level
if let Some(next_level) = self.levels.get(i + 1) {
for next_sst in next_level {
if next_sst.max_key >= sst.min_key
&& next_sst.min_key <= sst.max_key
{
inputs.push(next_sst.clone());
}
}
}
return Some(CompactionTask {
input_ssts: inputs,
output_level: level_num + 1,
});
}
}
}
None // No compaction needed
}
}
The L0-to-L1 compaction merges all L0 SSTables with the overlapping portion of L1. This is the most expensive compaction operation (L0 SSTables overlap each other, so the entire key range may be involved), but it's necessary to establish the non-overlapping property at L1.
The level-to-level compaction picks a single SSTable from the overfull level and merges it with the overlapping SSTables in the next level. Production implementations (RocksDB) cycle through SSTables in key order to ensure uniform compaction across the key space, preventing hot spots.
Key Takeaways
- Compaction is the LSM engine's background maintenance process — it merges SSTables to reduce read amplification and space amplification at the cost of write amplification.
- The three amplification factors (read, write, space) are fundamentally in tension. No compaction strategy minimizes all three. Leveled compaction favors reads; tiered favors writes; FIFO favors write throughput for TTL data.
- Leveled compaction organizes SSTables into levels with non-overlapping key ranges and exponentially increasing size targets. Write amplification of ~10x per level is the cost of O(L) read amplification.
- Tiered compaction groups SSTables into sorted runs of similar size. Write amplification of 2-5x is the reward for tolerating higher read and space amplification.
- L0 is special: SSTables have overlapping key ranges and must all be checked on every read. Flushing L0 to L1 is the highest-priority compaction task.
- Compaction scheduling must balance background I/O against foreground query latency. Throttle during peak load, compact aggressively during quiet periods.
Lesson 3 — Bloom Filters, Block Cache, and Read Optimization
Module: Database Internals — M03: LSM Trees & Compaction
Position: Lesson 3 of 3
Source: Database Internals — Alex Petrov, Chapters 7–8; Mini-LSM Week 1 Day 7
Source note: This lesson was synthesized from training knowledge. Verify the bloom filter bits-per-key formula and Petrov's block cache eviction policies against the source texts.
Context
The LSM read path from Lesson 1 checks every SSTable from newest to oldest. For a database with 50 SSTables, a negative lookup (key doesn't exist) requires 50 SSTable probes — each involving reading an index block and potentially a data block from disk. Even with leveled compaction limiting the effective probe count to one SSTable per level, a 4-level LSM still reads 4 SSTables per negative lookup.
Two optimizations close the read performance gap between LSM trees and B+ trees:
-
Bloom filters — a probabilistic data structure attached to each SSTable that answers "is this key possibly in this SSTable?" in O(1) with no disk I/O. A "no" answer is definitive (the key is definitely not there), so the engine skips the entire SSTable. With a 1% false positive rate, 99% of unnecessary SSTable probes are eliminated.
-
Block cache — an in-memory cache of recently-read SSTable data blocks and index blocks. Hot blocks stay in memory, eliminating disk reads for frequently-accessed keys. Combined with index block pinning (keeping all index blocks in memory permanently), this makes most SSTable lookups a single data block read.
Together, these two techniques reduce the practical I/O cost of an LSM point lookup to approximately 1 disk read — competitive with the B+ tree.
Core Concepts
Bloom Filters
A bloom filter is a bit array of m bits with k independent hash functions. To add a key, hash it with all k functions and set the corresponding bits. To check a key, hash it and verify all k bits are set. If any bit is 0, the key is definitely not in the set. If all bits are 1, the key is probably in the set (but might be a false positive).
The false positive probability for a bloom filter with m bits, k hash functions, and n inserted keys is approximately:
FPR ≈ (1 - e^(-kn/m))^k
The optimal number of hash functions for a given m/n (bits per key) ratio is k = (m/n) × ln(2).
Practical sizing for the OOR:
| Bits per key | False positive rate | Memory per 10,000 keys |
|---|---|---|
| 5 | 9.2% | 6.1 KB |
| 10 | 0.82% | 12.2 KB |
| 14 | 0.08% | 17.1 KB |
| 20 | 0.0006% | 24.4 KB |
10 bits per key is the standard choice (RocksDB default) — it provides a ~1% false positive rate at modest memory cost. For the OOR's 100,000 keys, the bloom filter per SSTable adds ~122 KB, trivially small compared to the SSTable data itself.
A bloom filter is stored in the SSTable's meta block and loaded into memory when the SSTable is opened. It is never written to disk again — it's read-only once created. This means the filter is computed during the SSTable build (compaction or flush) and persists for the SSTable's lifetime.
Bloom Filter in the Read Path
When the engine needs to check an SSTable for a key:
- Consult the in-memory bloom filter for that SSTable.
- If the filter says "no" → skip the SSTable entirely. Zero disk I/O.
- If the filter says "maybe" → read the index block, find the candidate data block, read the data block, and search for the key.
For a negative lookup across 50 SSTables with a 1% FPR:
- Without bloom filters: 50 SSTable probes (50 index reads + up to 50 data reads).
- With bloom filters: ~0.5 SSTable probes on average (50 × 0.01 = 0.5 false positives).
This transforms the LSM's worst-case operation into a near-best-case: negative lookups, which were O(L × SSTables) disk reads, become O(0.5) disk reads on average.
Block Cache
The block cache is an LRU cache (similar to the buffer pool from Module 1) that stores recently-accessed SSTable blocks in memory. Unlike the buffer pool, which caches full pages, the block cache stores individual SSTable blocks (typically 4KB) keyed by (sst_id, block_offset).
Two categories of blocks are cached:
- Data blocks — the actual key-value data. Cached on demand (when a read hits the block).
- Index blocks — the SSTable's internal index mapping last-key-per-block to block offset. Frequently accessed (every point lookup into the SSTable reads the index block first).
Many engines pin index blocks and filter blocks in the cache — they are loaded when the SSTable is opened and never evicted. This guarantees that every SSTable lookup requires at most one disk read (for the data block), because the filter and index are always in memory.
The block cache sits above the OS page cache and provides the engine with workload-aware caching. Like the buffer pool, it exists because the OS page cache doesn't understand SSTable access patterns — it can't distinguish between a hot data block and a cold compaction input.
Prefix Bloom Filters
Standard bloom filters answer "is this exact key in the SSTable?" For range scans, you need a different question: "does this SSTable contain any keys with this prefix?" A prefix bloom filter hashes key prefixes instead of full keys, enabling prefix-based filtering.
For the OOR, a prefix bloom on the first 3 bytes of the NORAD ID would let range scans skip SSTables that don't contain any keys in the target range. The false positive rate is higher (more keys share a prefix than match exactly), but the I/O savings for range scans are significant.
Combining Optimizations: End-to-End Read Path
A fully optimized LSM point lookup:
- Check active memtable (in-memory, O(log N)).
- Check immutable memtables (in-memory, O(log N) each).
- For each SSTable, newest to oldest: a. Check the bloom filter (in-memory, O(k) hash operations). If negative → skip. b. Read the index block (in block cache → 0 disk I/O if pinned). c. Binary search the index block for the target data block. d. Read the data block (in block cache → 0 I/O if hot; 1 disk read if cold). e. Search the data block for the key.
- First match (value or tombstone) terminates the search.
For a positive lookup on a hot key: 0 disk reads (everything in cache). For a negative lookup: 0 disk reads (bloom filters reject all SSTables). For a cold positive lookup: 1 disk read (the data block; index and filter are pinned).
Code Examples
A Simple Bloom Filter for SSTable Key Filtering
This implements the core bloom filter operations — build during SSTable creation, query during reads.
use std::hash::{Hash, Hasher};
use std::collections::hash_map::DefaultHasher;
/// A bloom filter for probabilistic key membership testing.
struct BloomFilter {
bits: Vec<u8>,
num_bits: usize,
num_hashes: u32,
}
impl BloomFilter {
/// Create a bloom filter sized for `num_keys` with the given
/// bits-per-key ratio. Optimal hash count is computed automatically.
fn new(num_keys: usize, bits_per_key: usize) -> Self {
let num_bits = num_keys * bits_per_key;
let num_bytes = (num_bits + 7) / 8;
// Optimal k = bits_per_key * ln(2) ≈ bits_per_key * 0.693
let num_hashes = ((bits_per_key as f64) * 0.693).ceil() as u32;
let num_hashes = num_hashes.max(1).min(30); // Clamp to [1, 30]
Self {
bits: vec![0u8; num_bytes],
num_bits,
num_hashes,
}
}
/// Add a key to the bloom filter.
fn insert(&mut self, key: &[u8]) {
for i in 0..self.num_hashes {
let bit_pos = self.hash(key, i) % self.num_bits;
self.bits[bit_pos / 8] |= 1 << (bit_pos % 8);
}
}
/// Check if a key might be in the set.
/// Returns false → definitely not in the set.
/// Returns true → possibly in the set (check the SSTable).
fn may_contain(&self, key: &[u8]) -> bool {
for i in 0..self.num_hashes {
let bit_pos = self.hash(key, i) % self.num_bits;
if self.bits[bit_pos / 8] & (1 << (bit_pos % 8)) == 0 {
return false; // Definitive: key is NOT in the set
}
}
true // All bits set — key is PROBABLY in the set
}
/// Generate the i-th hash of a key using double hashing.
/// h(i) = h1 + i * h2, where h1 and h2 are independent hashes.
fn hash(&self, key: &[u8], i: u32) -> usize {
let mut h1 = DefaultHasher::new();
key.hash(&mut h1);
let hash1 = h1.finish();
let mut h2 = DefaultHasher::new();
// Mix in a constant to get an independent second hash
(key, 0xDEADBEEFu32).hash(&mut h2);
let hash2 = h2.finish();
(hash1.wrapping_add((i as u64).wrapping_mul(hash2))) as usize
}
/// Serialize the bloom filter for storage in the SSTable meta block.
fn to_bytes(&self) -> Vec<u8> {
let mut buf = Vec::with_capacity(4 + 4 + self.bits.len());
buf.extend_from_slice(&(self.num_bits as u32).to_le_bytes());
buf.extend_from_slice(&self.num_hashes.to_le_bytes());
buf.extend_from_slice(&self.bits);
buf
}
}
The double hashing technique generates k hash values from just two base hashes: h(i) = h1 + i × h2. This is mathematically equivalent to using k independent hash functions for bloom filter purposes (proven by Kirsch and Mitzenmacher, 2006) and much cheaper to compute.
The DefaultHasher is SipHash, which is well-distributed but not the fastest. Production bloom filters use xxHash or wyhash for speed. The algorithm is the same regardless of hash function — only throughput changes.
Integrating the Bloom Filter into SSTable Reads
Modifying the LSM read path to consult bloom filters before reading SSTable blocks.
/// Check an SSTable for a key, using the bloom filter to skip if possible.
fn check_sstable(
sst: &SsTableReader,
key: &[u8],
block_cache: &mut BlockCache,
) -> io::Result<Option<Option<Vec<u8>>>> {
// Step 1: Bloom filter check (in-memory, zero I/O)
if !sst.bloom_filter.may_contain(key) {
return Ok(None); // Definitely not in this SSTable
}
// Step 2: Index block lookup (cached or pinned, usually zero I/O)
let block_handle = sst.find_block_for_key(key, block_cache)?;
// Step 3: Data block read (1 disk read if not cached)
let data_block = block_cache.get_or_load(
sst.id,
block_handle.offset,
block_handle.size,
&sst.file,
)?;
// Step 4: Search the data block for the key
match data_block.find(key) {
Some(entry) => Ok(Some(entry.value.clone())), // Found (value or tombstone)
None => Ok(None), // Key not in this block (bloom filter false positive)
}
}
When the bloom filter returns false (line 4), the entire SSTable is skipped — no index read, no data read, no disk I/O. This is the single biggest read-path optimization in the LSM architecture.
When the bloom filter returns true but the key isn't actually in the SSTable (false positive), the engine performs an unnecessary index + data block read. At 1% FPR, this happens once per 100 negative probes per SSTable — rare enough to be negligible.
Key Takeaways
- Bloom filters eliminate 99% of unnecessary SSTable probes for negative lookups at 10 bits per key. This transforms the LSM's worst case (negative lookups checking every SSTable) into a near-zero-I/O operation.
- The false positive rate is tunable via bits-per-key: 10 bits → ~1% FPR, 14 bits → ~0.1% FPR. The OOR should use 10 bits per key as the default, matching RocksDB's default.
- Block cache stores recently-accessed SSTable blocks in memory. Pinning index and filter blocks guarantees that every SSTable lookup costs at most 1 disk read (for the data block).
- The fully optimized LSM read path: bloom filter (in-memory) → index block (pinned in cache) → data block (cached or 1 disk read). For hot keys, this is 0 disk reads — competitive with B+ tree performance.
- Prefix bloom filters extend filtering to range scans by hashing key prefixes instead of full keys. Higher false positive rate but significant I/O savings for prefix-based range queries.
Project — LSM-Backed TLE Storage Engine
Module: Database Internals — M03: LSM Trees & Compaction
Track: Orbital Object Registry
Estimated effort: 8–10 hours
- SDA Incident Report — OOR-2026-0044
- Objective
- Acceptance Criteria
- Starter Structure
- Hints
- What Comes Next
SDA Incident Report — OOR-2026-0044
Classification: ENGINEERING DIRECTIVE
Subject: Build LSM storage engine prototype for write-optimized TLE ingestionThe B+ tree index cannot sustain burst ingestion rates during fragmentation events. Build an LSM-tree-based storage engine that batches writes in a memtable, flushes to immutable SSTables, and uses leveled compaction to bound read amplification. Bloom filters on each SSTable must reduce negative lookup cost to near-zero.
Objective
Build a complete LSM storage engine that:
- Accepts
put(key, value)anddelete(key)into an in-memory memtable - Freezes and flushes the memtable to SSTable files when a size threshold is reached
- Supports
get(key)by probing the memtable, then SSTables from newest to oldest - Implements a simple leveled compaction (merge all L0 into L1) triggered by SSTable count
- Attaches a bloom filter to each SSTable for fast negative lookups
- Supports
scan(start_key, end_key)via a merge iterator over all sources
Acceptance Criteria
-
Write throughput. Insert 100,000 TLE records with a 4MB memtable limit. Measure and print the total time and records/second. Target: >50,000 records/second (in-memory memtable inserts).
-
Memtable flush. Verify that SSTables are created on disk after the memtable reaches the size threshold. Print the number of SSTables after all inserts.
-
Point lookup correctness. After all inserts, look up 1,000 random NORAD IDs and verify each returns the correct TLE record. Look up 1,000 non-existent IDs and verify each returns
None. -
Bloom filter effectiveness. Report bloom filter hit/miss stats: how many SSTable probes were skipped by the bloom filter during the 1,000 negative lookups. Target: >95% skip rate.
-
Delete correctness. Delete 1,000 records. Verify
getreturnsNonefor deleted keys. Verify that non-deleted keys adjacent to deleted keys still return correct values. -
Compaction. Trigger compaction (merge L0 SSTables into L1). Verify that the number of SSTables decreases. Verify all keys are still accessible after compaction. Verify that deleted keys (tombstones) are garbage-collected if compaction output is the bottommost level.
-
Range scan. Scan NORAD IDs [40000, 40500]. Verify the results are in sorted order and include exactly the expected keys.
Starter Structure
lsm-storage/
├── Cargo.toml
├── src/
│ ├── main.rs # Entry point: runs acceptance criteria
│ ├── memtable.rs # MemTable: BTreeMap-backed sorted store
│ ├── sstable.rs # SsTableBuilder, SsTableReader
│ ├── bloom.rs # BloomFilter: insert, may_contain, serialize
│ ├── merge_iter.rs # MergeIterator: multi-way merge over sorted sources
│ ├── compaction.rs # Compaction controller and merge logic
│ └── lsm.rs # LsmEngine: top-level API (put, get, delete, scan)
Hints
Hint 1 — SSTable file naming
Name SSTable files with a monotonically increasing ID: sst_000001.dat, sst_000002.dat, etc. Higher IDs are newer. The LSM state (which SSTables are active) can be tracked in memory as a Vec<SstMeta> per level. Persist the LSM state to a manifest file for crash recovery (or defer this to Module 4).
Hint 2 — Simple L0→L1 compaction trigger
The simplest compaction trigger: when the number of L0 SSTables reaches 4, merge all L0 SSTables into a single sorted L1 SSTable. This eliminates L0's overlapping key ranges. If L1 already has SSTables, include them in the merge to maintain the non-overlapping invariant.
Hint 3 — Merge iterator design
Use a BinaryHeap<Reverse<(key, source_id, value)>> as the min-heap. The source_id breaks ties: lower source ID = newer source. When popping an entry, skip all entries with the same key from older sources (they are superseded).
Hint 4 — Atomicity of SSTable swap
During compaction, write all output SSTables before modifying the LSM state. Then atomically update the state (swap old inputs for new outputs). If the engine crashes mid-compaction, the old SSTables are still valid — the output SSTables are orphaned files that can be cleaned up. This is the simplest crash-safe compaction strategy without a full WAL/manifest (which Module 4 adds).
What Comes Next
Module 4 (WAL & Recovery) adds durability. The memtable is volatile — if the process crashes, unflushed writes are lost. The WAL logs every write before it enters the memtable, enabling recovery. The manifest log tracks which SSTables are active, enabling crash-safe compaction.
Module 04 — Write-Ahead Logging & Recovery
Track: Database Internals — Orbital Object Registry
Position: Module 4 of 6
Source material: Database Internals — Alex Petrov, Chapters 9–10; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 7
Quiz pass threshold: 70% on all three lessons to unlock the project
Mission Context
SDA INCIDENT REPORT — OOR-2026-0045
Classification: DATA LOSS INCIDENT
Subject: 2,400 TLE records lost after unplanned power failureAt 03:17 UTC, a PDU failure at Ground Station Bravo caused an unclean shutdown of the OOR storage engine. The active memtable contained approximately 2,400 TLE updates from the preceding 12-minute pass window. Because the memtable is a volatile in-memory structure, all 2,400 records were lost. The LSM engine restarted with only the previously flushed SSTables, leaving the catalog 12 minutes stale. Two conjunction alerts were delayed because the missing TLEs contained the most recent orbital elements for objects in a close-approach trajectory.
Directive: Implement a write-ahead log. Every mutation must be logged to durable storage before it is applied to the memtable. On crash recovery, replay the WAL to reconstruct the memtable to its pre-crash state.
Learning Outcomes
After completing this module, you will be able to:
- Explain the write-ahead rule and why it is the foundation of crash recovery
- Implement a WAL that logs key-value operations to an append-only file with checksummed records
- Describe the ARIES recovery protocol — analysis, redo, and undo phases
- Implement crash recovery by replaying the WAL to reconstruct the memtable
- Design a checkpointing strategy that limits WAL replay time after a crash
- Reason about the tradeoff between
fsyncfrequency and durability guarantees
Lesson Summary
Lesson 1 — WAL Fundamentals
The write-ahead rule, log record format, LSN ordering, and the WAL's role in the LSM write path. Why fsync is the only guarantee of durability, and the latency cost of calling it.
Key question: What is the maximum data loss window under group commit with 50ms batch intervals?
Lesson 2 — Crash Recovery
The ARIES recovery protocol adapted for the LSM engine. Analysis phase (determine which WAL records need replay), redo phase (replay committed operations into the memtable), and how the manifest tracks SSTable state for consistent restart.
Key question: If the engine crashes during a compaction, does it use the old or new SSTables on recovery?
Lesson 3 — Checkpointing
Fuzzy checkpoints that snapshot the LSM state without blocking writes. How checkpoints bound WAL replay time and enable WAL truncation. The tradeoff between checkpoint frequency and recovery time.
Key question: What is the maximum WAL replay time for the OOR with 60-second checkpoint intervals?
Capstone Project — Durable TLE Update Pipeline
Add WAL-based durability to the Module 3 LSM engine. Every write is logged before entering the memtable. On simulated crash, the engine recovers to a consistent state by replaying the WAL. Full brief in project-durable-pipeline.md.
File Index
module-04-wal-recovery/
├── README.md
├── lesson-01-wal-fundamentals.md
├── lesson-01-quiz.toml
├── lesson-02-crash-recovery.md
├── lesson-02-quiz.toml
├── lesson-03-checkpointing.md
├── lesson-03-quiz.toml
└── project-durable-pipeline.md
Prerequisites
- Module 3 (LSM Trees & Compaction) completed
What Comes Next
Module 5 (Transactions & Isolation) adds concurrent read/write support with MVCC snapshot isolation, building on the durable foundation established here.
Lesson 1 — WAL Fundamentals
Module: Database Internals — M04: Write-Ahead Logging & Recovery
Position: Lesson 1 of 3
Source: Database Internals — Alex Petrov, Chapters 9–10; Mini-LSM Week 2 Day 6
Source note: This lesson was synthesized from training knowledge. Verify Petrov's WAL record format, LSN semantics, and
fsyncdiscussion against Chapters 9–10.
Context
The LSM engine from Module 3 achieves high write throughput by buffering writes in a memtable and flushing them to SSTables in the background. But the memtable is volatile — it lives in process memory. If the process crashes, the OS kills the process, or power fails, the memtable's contents are lost. For the OOR, this means losing every TLE update since the last flush — potentially minutes of orbital data during an active pass window.
The Write-Ahead Log (WAL) is the solution: an append-only file on durable storage that records every mutation before it is applied to the memtable. The key insight is the write-ahead rule: no modification to the in-memory state is visible until its corresponding log record has been durably written to the WAL. If the engine crashes after logging but before flushing, the WAL contains enough information to reconstruct the memtable by replaying the logged operations.
This changes the durability guarantee from "data is safe after SSTable flush" (every 5–60 seconds) to "data is safe after WAL write" (every operation, or every batch). The cost is one sequential disk write per operation (or per batch) — but sequential appends are cheap, especially on SSDs.
Core Concepts
The Write-Ahead Rule
The rule is simple and inviolable: log first, then mutate. The LSM write path becomes:
- Serialize the operation (
putordelete) into a WAL log record. - Append the record to the WAL file.
- Call
fsyncon the WAL file (or batchfsyncfor a group of records). - Apply the operation to the memtable.
- Return success to the caller.
If the engine crashes between steps 2 and 4, the WAL contains the operation but the memtable does not — recovery will replay it. If the engine crashes before step 2, the operation was never logged — the caller did not receive a success response, so the operation is not considered committed.
WAL Record Format
Each WAL record is a self-contained, checksummed unit:
┌──────────────────────────────────────────┐
│ Record Header │
│ - LSN: u64 (log sequence number) │
│ - record_type: u8 (Put=1, Delete=2) │
│ - key_len: u16 │
│ - value_len: u16 │
│ - checksum: u32 (CRC32 of key+value) │
├──────────────────────────────────────────┤
│ Key bytes (key_len bytes) │
├──────────────────────────────────────────┤
│ Value bytes (value_len bytes, if Put) │
└──────────────────────────────────────────┘
The Log Sequence Number (LSN) is a monotonically increasing identifier assigned to each record. LSNs establish a total order over all operations — recovery replays records in LSN order to reconstruct the exact pre-crash state. The LSN also correlates WAL records with SSTable flushes: when the memtable is flushed, the engine records the highest LSN contained in that flush. On recovery, only WAL records with LSNs greater than the last flushed LSN need to be replayed.
The checksum protects against partial writes: if the engine crashes mid-write, the incomplete record will have an invalid checksum and is discarded during recovery.
fsync Strategies
fsync is expensive: 0.1–1ms on SSD, 5–30ms on spinning disk. Three strategies for trading durability against throughput:
Per-operation fsync: Maximum durability — at most one operation can be lost. Throughput limited by fsync latency (~1,000–10,000 ops/sec on SSD).
Group commit (batch fsync): Buffer multiple WAL writes, then fsync the batch. If the batch covers 100 operations and fsync takes 0.5ms, the amortized cost is 5µs per operation. Standard approach — used by RocksDB, PostgreSQL, MySQL.
No fsync: Maximum throughput, minimum durability — a crash can lose up to 30 seconds of data. Acceptable for caches, not for the OOR.
For the OOR, group commit is the correct choice: batch TLE updates per pass window, fsync once per batch.
WAL in the LSM Write Path
put(key, value)
│
▼
WAL append (serialize + write + optional fsync)
│
▼
Memtable insert (in-memory, fast)
│
▼
Return success
When the memtable is flushed to an SSTable, the engine records the flushed LSN. WAL records at or below this LSN are no longer needed for recovery, enabling WAL truncation.
Code Examples
WAL Writer: Appending Checksummed Records
use std::io::{self, Write, BufWriter};
use std::fs::{File, OpenOptions};
#[repr(u8)]
#[derive(Clone, Copy)]
enum WalRecordType {
Put = 1,
Delete = 2,
}
struct WalWriter {
writer: BufWriter<File>,
next_lsn: u64,
}
impl WalWriter {
fn open(path: &str) -> io::Result<Self> {
let file = OpenOptions::new()
.create(true)
.append(true)
.open(path)?;
Ok(Self {
writer: BufWriter::new(file),
next_lsn: 0,
})
}
fn log_put(&mut self, key: &[u8], value: &[u8]) -> io::Result<u64> {
self.write_record(WalRecordType::Put, key, Some(value))
}
fn log_delete(&mut self, key: &[u8]) -> io::Result<u64> {
self.write_record(WalRecordType::Delete, key, None)
}
fn write_record(
&mut self,
rec_type: WalRecordType,
key: &[u8],
value: Option<&[u8]>,
) -> io::Result<u64> {
let lsn = self.next_lsn;
self.next_lsn += 1;
let val = value.unwrap_or(&[]);
// Checksum covers key + value
let mut hasher = crc32fast::Hasher::new();
hasher.update(key);
hasher.update(val);
let checksum = hasher.finalize();
// Header: LSN(8) + type(1) + key_len(2) + val_len(2) + checksum(4) = 17 bytes
self.writer.write_all(&lsn.to_le_bytes())?;
self.writer.write_all(&[rec_type as u8])?;
self.writer.write_all(&(key.len() as u16).to_le_bytes())?;
self.writer.write_all(&(val.len() as u16).to_le_bytes())?;
self.writer.write_all(&checksum.to_le_bytes())?;
self.writer.write_all(key)?;
self.writer.write_all(val)?;
Ok(lsn)
}
/// Flush to durable storage. Call after a batch for group commit.
fn sync(&mut self) -> io::Result<()> {
self.writer.flush()?;
self.writer.get_ref().sync_all()
}
}
WAL Reader: Replaying Records for Recovery
struct WalRecord {
lsn: u64,
rec_type: WalRecordType,
key: Vec<u8>,
value: Vec<u8>,
}
struct WalReader {
reader: std::io::BufReader<File>,
}
impl WalReader {
fn next_record(&mut self) -> io::Result<Option<WalRecord>> {
let mut header = [0u8; 17];
match self.reader.read_exact(&mut header) {
Ok(()) => {}
Err(e) if e.kind() == io::ErrorKind::UnexpectedEof => return Ok(None),
Err(e) => return Err(e),
}
let lsn = u64::from_le_bytes(header[0..8].try_into().unwrap());
let rec_type = match header[8] {
1 => WalRecordType::Put,
2 => WalRecordType::Delete,
_ => return Err(io::Error::new(
io::ErrorKind::InvalidData, "invalid WAL record type",
)),
};
let key_len = u16::from_le_bytes(header[9..11].try_into().unwrap()) as usize;
let val_len = u16::from_le_bytes(header[11..13].try_into().unwrap()) as usize;
let stored_checksum = u32::from_le_bytes(header[13..17].try_into().unwrap());
let mut key = vec![0u8; key_len];
let mut value = vec![0u8; val_len];
self.reader.read_exact(&mut key)?;
self.reader.read_exact(&mut value)?;
// Verify checksum — detects partial writes from crashes
let mut hasher = crc32fast::Hasher::new();
hasher.update(&key);
hasher.update(&value);
if hasher.finalize() != stored_checksum {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
format!("WAL checksum mismatch at LSN {} — partial write detected", lsn),
));
}
Ok(Some(WalRecord { lsn, rec_type, key, value }))
}
}
The checksum verification is the corruption boundary: a failed checksum means the engine crashed mid-write. All prior records are valid; this record and everything after it are discarded.
Key Takeaways
- The write-ahead rule is absolute: log the operation to durable storage before applying it to the memtable. This guarantees that committed operations survive crashes.
fsyncis the durability boundary. Group commit amortizes the cost across many operations — the standard approach for production engines.- Each WAL record is self-contained with a CRC32 checksum. Partial writes are detected and discarded during recovery.
- The LSN orders all operations and correlates WAL records with SSTable flushes. WAL records at or below the flushed LSN are safe to truncate.
- WAL writes are sequential appends — the cheapest form of disk I/O.
Lesson 2 — Crash Recovery
Module: Database Internals — M04: Write-Ahead Logging & Recovery
Position: Lesson 2 of 3
Source: Database Internals — Alex Petrov, Chapter 10
Source note: This lesson was synthesized from training knowledge. Verify Petrov's ARIES protocol adaptation for LSM engines against Chapter 10.
Context
The WAL ensures every committed operation is recorded on durable storage. After a crash, the engine must use that log to return to a consistent state. This is the job of the crash recovery protocol — a deterministic procedure that reads the WAL, determines what was lost, and reconstructs the in-memory state.
For the OOR's LSM engine, recovery is simpler than for a traditional B-tree database because SSTables are immutable. There are no dirty pages to redo or uncommitted transactions to undo — the only volatile state is the memtable. Recovery reconstructs the memtable by replaying WAL records that were not yet flushed to an SSTable.
The classical protocol is ARIES (Algorithms for Recovery and Isolation Exploiting Semantics). ARIES has three phases — analysis, redo, and undo. For the LSM engine, we adapt it: the analysis phase determines the recovery starting point, the redo phase replays the WAL into the memtable, and the undo phase is unnecessary (no uncommitted transactions to roll back at this stage).
Core Concepts
The Manifest
The manifest is a metadata file that records the LSM engine's durable state: which SSTables are active, what level each belongs to, and the highest flushed LSN. On recovery, the manifest is the starting point.
The manifest is itself an append-only log. Each entry records a state change:
[1] AddSSTable { id: 1, level: 0, min_key: "a", max_key: "z", flushed_lsn: 500 }
[2] AddSSTable { id: 2, level: 0, min_key: "b", max_key: "y", flushed_lsn: 1000 }
[3] Compaction { removed: [1, 2], added: [3], output_level: 1, flushed_lsn: 1000 }
LSM Recovery Protocol
-
Read the manifest. Reconstruct the active SSTable set and determine the highest flushed LSN.
-
Open the WAL. Seek to the first record with LSN > flushed_lsn.
-
Replay WAL records. For each valid record, apply it to a fresh memtable:
Put→ insert the key-value pairDelete→ insert a tombstone
-
Stop at corruption. If a record's checksum fails, stop replaying. That record and everything after it are discarded.
-
Resume normal operation. The memtable now contains all committed-but-unflushed operations.
Recovery timeline:
[SSTables on disk] [WAL on disk]
├─ flushed to LSN 1000 ─┤─ LSN 1001..1234 valid ─┤─ LSN 1235 corrupt ─┤
│ │
Replay into memtable Stop here
Crash During Compaction
If the engine crashes mid-compaction:
- Before manifest update: Old SSTables are still listed as active. Partially-written new SSTables are orphaned files. Recovery uses the old SSTables. Orphans are cleaned up.
- After manifest update: New SSTables are active. Old SSTables are marked for deletion. Recovery uses the new SSTables.
The manifest update is the atomicity boundary. The pattern: write data first, then atomically update metadata.
Crash During Flush
If the engine crashes mid-flush:
- Before manifest update: The partial SSTable is orphaned. The WAL still contains all records. Recovery replays the WAL.
- After manifest update: The SSTable is complete and active. WAL records up to the flushed LSN are redundant.
Orphan Cleanup
On startup, the engine scans the data directory for SSTable files not referenced by the manifest. These are orphans from interrupted compactions or flushes. They are deleted before normal operation begins.
Code Examples
LSM Engine Recovery
impl LsmEngine {
fn recover(db_path: &str) -> io::Result<Self> {
// Phase 1: Read manifest
let (sst_state, flushed_lsn) = ManifestReader::replay(
&format!("{}/MANIFEST", db_path)
)?;
eprintln!("Recovery: {} SSTables, flushed LSN = {}", sst_state.total_count(), flushed_lsn);
// Phase 2: Replay WAL
let mut memtable = MemTable::new(16 * 1024 * 1024);
let mut replayed = 0u64;
let mut next_lsn = flushed_lsn + 1;
if let Ok(mut reader) = WalReader::open(&format!("{}/WAL", db_path)) {
loop {
match reader.next_record() {
Ok(Some(record)) => {
if record.lsn <= flushed_lsn { continue; }
match record.rec_type {
WalRecordType::Put => memtable.put(record.key, record.value),
WalRecordType::Delete => memtable.delete(record.key),
}
next_lsn = record.lsn + 1;
replayed += 1;
}
Ok(None) => break,
Err(e) => {
eprintln!("Recovery: corrupt record ({}), {} replayed", e, replayed);
break;
}
}
}
}
eprintln!("Recovery: replayed {} WAL records", replayed);
// Phase 3: Clean up orphaned SSTables
cleanup_orphans(db_path, &sst_state)?;
// Phase 4: Open fresh WAL and manifest for new writes
let wal = WalWriter::open(&format!("{}/WAL", db_path))?;
let manifest = ManifestWriter::open(&format!("{}/MANIFEST", db_path))?;
Ok(Self { active_memtable: memtable, immutable_memtables: Vec::new(),
sst_state, wal, manifest, next_lsn })
}
}
fn cleanup_orphans(db_path: &str, sst_state: &LsmState) -> io::Result<()> {
let active_ids: std::collections::HashSet<u64> = sst_state.all_sst_ids().collect();
for entry in std::fs::read_dir(db_path)? {
let entry = entry?;
let name = entry.file_name().to_string_lossy().to_string();
if name.starts_with("sst_") && name.ends_with(".dat") {
let id: u64 = name[4..name.len()-4].parse().unwrap_or(u64::MAX);
if !active_ids.contains(&id) {
eprintln!("Recovery: deleting orphaned SSTable {}", name);
std::fs::remove_file(entry.path())?;
}
}
}
Ok(())
}
The recovery procedure is deterministic: given the same manifest and WAL files, it always produces the same memtable state.
Key Takeaways
- LSM crash recovery is simpler than B-tree recovery because SSTables are immutable. The only volatile state to reconstruct is the memtable.
- The manifest tracks active SSTables and the flushed LSN. It is the recovery starting point and the atomicity boundary for compaction and flush operations.
- Recovery replays WAL records with LSN > flushed_lsn into a fresh memtable. Partial writes are detected by checksum failure and discarded.
- The pattern "write data, then atomically update metadata" applies to both flushes and compactions. The manifest update is the commit point.
- Orphan cleanup on startup removes SSTable files from interrupted operations that were never recorded in the manifest.
Lesson 3 — Checkpointing
Module: Database Internals — M04: Write-Ahead Logging & Recovery
Position: Lesson 3 of 3
Source: Database Internals — Alex Petrov, Chapter 10
Source note: This lesson was synthesized from training knowledge. Verify Petrov's fuzzy checkpoint algorithm and his WAL truncation semantics against Chapter 10.
Context
Without checkpointing, the WAL grows indefinitely. If the engine has been running for 24 hours with 100,000 TLE updates, the WAL contains 100,000 records — all of which must be scanned during recovery to find those with LSN > flushed_lsn. Recovery time grows linearly with WAL size. For a system that must return to service within seconds after a crash (the OOR's conjunction avoidance SLA requires <30s recovery), unbounded WAL growth is unacceptable.
A checkpoint snapshots the current LSM state and records the position where recovery should start. After a checkpoint, WAL records before the checkpoint position can be safely deleted, bounding both WAL size and recovery time.
Core Concepts
What a Checkpoint Records
A checkpoint writes the following to the manifest:
- Checkpoint LSN — the highest LSN that is fully durable (either in an SSTable or committed to the WAL and fsync'd at checkpoint time).
- Active SSTable list — every SSTable currently in the LSM state, with level assignments.
- Memtable status — the LSN range of the current active memtable (not yet flushed).
After the checkpoint, the WAL can be truncated up to the minimum recovery LSN: the smallest LSN that might still need replay. This is the lower bound of the active memtable's LSN range at checkpoint time.
Fuzzy Checkpoints
A sharp checkpoint freezes all writes, flushes the memtable, records the state, and then resumes. This guarantees that the checkpoint LSN is fully consistent — but it blocks writes for the duration of the flush (potentially seconds).
A fuzzy checkpoint avoids blocking writes:
- Record the current memtable's LSN range and the active SSTable list.
- Write this snapshot to the manifest.
- Continue accepting writes — the memtable keeps growing.
The tradeoff: recovery after a fuzzy checkpoint must replay WAL records from the memtable's start LSN (not the checkpoint LSN), because the memtable was not flushed at checkpoint time. Fuzzy checkpoints are faster (no flush) but result in slightly longer recovery (more WAL records to replay).
In the LSM architecture, fuzzy checkpoints are natural: the memtable flush is a form of checkpointing. Every time a memtable is flushed to an SSTable, the flushed LSN advances, and older WAL records become eligible for truncation. Explicit checkpoints are only needed if flush intervals are very long.
WAL Truncation
After a checkpoint (or flush), WAL records below the minimum recovery LSN can be deleted. Two strategies:
Segment-based: The WAL is split into fixed-size segments (e.g., 64MB files). A segment can be deleted when all its records have LSN ≤ the minimum recovery LSN. Simple and efficient — the filesystem handles cleanup.
Single-file with logical truncation: The WAL is one file. A "truncation point" is maintained in the manifest. On recovery, records before this point are skipped. The file is physically truncated (or rewritten) during periodic maintenance.
Segment-based is the standard approach (used by RocksDB, Kafka, PostgreSQL's WAL segments). It avoids the complexity of in-place truncation and enables simple space reclamation.
Recovery Time Analysis
Recovery time = time to read manifest + time to replay WAL records.
Manifest replay is fast (typically <100 entries). WAL replay dominates: each record requires deserialization and a memtable insert. At ~1µs per memtable insert and 100,000 records to replay, recovery takes ~100ms for the replay phase.
Checkpointing bounds this: with checkpoints every 60 seconds and 3,000 writes/sec, the maximum WAL replay is ~180,000 records = ~180ms. Well within the OOR's 30-second recovery SLA.
Coordinating Checkpoints with Compaction
Checkpoints and compaction both modify the manifest. To prevent conflicts:
- Acquire a manifest lock before writing a checkpoint or compaction result.
- Write the manifest entry.
- Release the lock.
The lock is held briefly (one file write + fsync), so contention is low. The manifest itself is append-only, so there are no conflicting edits — only the ordering of entries matters.
Code Examples
Checkpoint Implementation
impl LsmEngine {
/// Write a fuzzy checkpoint to the manifest.
/// This records the current LSM state without blocking writes.
fn checkpoint(&mut self) -> io::Result<()> {
// Snapshot the current state
let active_ssts: Vec<SstMeta> = self.sst_state.all_sstables().cloned().collect();
let memtable_min_lsn = self.active_memtable_min_lsn();
let checkpoint_lsn = self.next_lsn - 1;
// Write checkpoint to manifest
self.manifest.write_checkpoint(CheckpointEntry {
checkpoint_lsn,
memtable_min_lsn,
active_sstables: active_ssts.iter().map(|s| s.id).collect(),
})?;
self.manifest.sync()?;
eprintln!(
"Checkpoint at LSN {}: {} active SSTables, \
WAL replay starts at LSN {}",
checkpoint_lsn, active_ssts.len(), memtable_min_lsn,
);
// Truncate WAL segments that are fully below memtable_min_lsn
self.wal.truncate_before(memtable_min_lsn)?;
Ok(())
}
fn active_memtable_min_lsn(&self) -> u64 {
// The earliest LSN in the active memtable is the minimum
// recovery point — WAL records before this are redundant.
// If the memtable is empty, use the flushed LSN.
self.active_memtable
.min_lsn()
.unwrap_or(self.sst_state.flushed_lsn())
}
}
The checkpoint does not flush the memtable — it records where the memtable starts (min LSN) so recovery knows where to begin WAL replay. This is the "fuzzy" part: writes continue during and after the checkpoint, but the WAL truncation point is safely advanced.
WAL Segment Manager
/// Manages WAL as a series of fixed-size segments for clean truncation.
struct WalSegmentManager {
dir: String,
segment_size: usize,
active_segment: WalWriter,
active_segment_id: u64,
}
impl WalSegmentManager {
/// Truncate (delete) all WAL segments whose max LSN is below the given LSN.
fn truncate_before(&mut self, min_recovery_lsn: u64) -> io::Result<()> {
let entries = std::fs::read_dir(&self.dir)?;
for entry in entries {
let entry = entry?;
let name = entry.file_name().to_string_lossy().to_string();
if name.starts_with("wal_") && name.ends_with(".log") {
// Parse segment ID from filename: wal_000042.log → 42
let seg_id = name[4..10].parse::<u64>().unwrap_or(u64::MAX);
// Each segment covers a known LSN range.
// Conservative: only delete if segment_max_lsn < min_recovery_lsn
if self.segment_max_lsn(seg_id) < min_recovery_lsn {
std::fs::remove_file(entry.path())?;
eprintln!("WAL: deleted segment {}", name);
}
}
}
Ok(())
}
fn segment_max_lsn(&self, segment_id: u64) -> u64 {
// In practice, track this in memory or in the segment header.
// Simplified: assume segments hold a known max number of records.
(segment_id + 1) * (self.segment_size as u64 / 103) // ~records per segment
}
}
Segment-based truncation is simple: delete files whose records are all below the recovery starting point. No in-place file modification, no complex bookkeeping. The filesystem handles space reclamation.
Key Takeaways
- Checkpoints bound WAL size and recovery time by recording a recovery starting point. Without checkpoints, the WAL grows indefinitely and recovery scans the entire log.
- Fuzzy checkpoints avoid blocking writes — they snapshot the LSM state without flushing the memtable. The tradeoff is slightly longer recovery (WAL replay from the memtable's start LSN, not the checkpoint LSN).
- In an LSM engine, every memtable flush is implicitly a checkpoint — it advances the flushed LSN and makes earlier WAL records eligible for truncation.
- WAL segments (fixed-size log files) enable clean truncation by deleting entire segment files. This is simpler and more efficient than truncating a single growing file.
- Recovery time for the OOR: ~180ms worst case with 60-second checkpoint intervals and 3,000 writes/sec. Well within the 30-second conjunction avoidance SLA.
Project — Durable TLE Update Pipeline
Module: Database Internals — M04: Write-Ahead Logging & Recovery
Track: Orbital Object Registry
Estimated effort: 6–8 hours
SDA Incident Report — OOR-2026-0045
Classification: ENGINEERING DIRECTIVE
Subject: Add WAL-based durability to the LSM storage engineRef: OOR-2026-0045 (data loss incident after PDU failure)
Integrate a write-ahead log into the Module 3 LSM engine. Every mutation must be logged before it enters the memtable. The engine must recover to a consistent state after simulated crashes.
Acceptance Criteria
-
WAL write path. Every
putanddeletecall appends a checksummed record to the WAL before modifying the memtable. Verify by inspecting the WAL file after 1,000 inserts. -
Clean recovery. Insert 5,000 records, gracefully shut down, then recover. All 5,000 records must be accessible after recovery.
-
Crash recovery. Insert 5,000 records. Simulate a crash by calling
std::process::abort()(or simply skipping the shutdown routine). Restart and recover. Records up to the last fsync'd batch must be accessible. Report how many records were recovered vs. the expected count. -
Crash during flush. Insert records until a memtable flush is triggered. Simulate a crash mid-flush (after writing the SSTable but before updating the manifest). Recover and verify all data is intact — the orphaned SSTable is ignored, and the WAL is replayed to reconstruct the memtable.
-
WAL truncation. After recovery, trigger a flush and checkpoint. Verify the WAL is truncated — old segments are deleted, and the remaining WAL contains only records above the flushed LSN.
-
Recovery time. Measure recovery time for WAL sizes of 10,000, 50,000, and 100,000 records. Report the time for each. Target: recovery < 500ms for 100,000 records.
-
Manifest correctness. After multiple flush/compaction/checkpoint cycles, recover the engine and verify the manifest correctly reports the active SSTable set and flushed LSN.
Starter Structure
durable-pipeline/
├── Cargo.toml
├── src/
│ ├── main.rs # Entry point: runs acceptance criteria
│ ├── wal.rs # WalWriter, WalReader, WalSegmentManager
│ ├── manifest.rs # ManifestWriter, ManifestReader, checkpoint
│ ├── lsm.rs # LsmEngine with WAL integration and recovery
│ ├── memtable.rs # Reuse from Module 3
│ ├── sstable.rs # Reuse from Module 3
│ ├── bloom.rs # Reuse from Module 3
│ └── compaction.rs # Reuse from Module 3
Hints
Hint 1 — Simulating a crash
The simplest crash simulation: after writing N records, drop the LsmEngine without calling any shutdown method, then construct a new LsmEngine::recover(). Alternatively, write to a temporary directory, copy/rename files to simulate partial state, and then recover from the copy.
Hint 2 — Manifest format
Keep the manifest simple: a sequence of newline-delimited JSON records. Each record is either {"type": "add_sst", "id": 42, "level": 1, "flushed_lsn": 5000} or {"type": "remove_sst", "ids": [31, 32, 33]} or {"type": "checkpoint", "lsn": 10000, "active_ssts": [42, 43]}. Parse with serde_json or manual string parsing.
Hint 3 — Crash-during-flush simulation
Write the SSTable file, then abort before writing the manifest entry. On recovery, the manifest doesn't list the SSTable. Scan the data directory for SSTable files not in the manifest and delete them (orphan cleanup). Replay the WAL to reconstruct the memtable.
What Comes Next
Module 5 (Transactions & Isolation) adds MVCC support — concurrent readers see consistent snapshots while writers continue modifying the database. The WAL and manifest from this module provide the durability foundation that MVCC transactions depend on.
Module 05 — Transactions & Isolation
Track: Database Internals — Orbital Object Registry
Position: Module 5 of 6
Source material: Database Internals — Alex Petrov, Chapters 12–13; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 7; Mini-LSM Week 3
Quiz pass threshold: 70% on all three lessons to unlock the project
Mission Context
SDA INCIDENT REPORT — OOR-2026-0046
Classification: DATA ANOMALY
Subject: Conjunction query returned stale TLE data during concurrent catalog updateAt 14:22 UTC, a conjunction assessment for NORAD 43013 used TLE epoch 2026-084.2 while a concurrent bulk update was writing epoch 2026-084.7 for the same object. The assessment computed a miss distance of 2.3km using the stale epoch. The updated epoch would have yielded 0.8km — below the avoidance maneuver threshold. The conjunction alert was delayed by 4 minutes until the next assessment cycle picked up the updated TLE.
Root cause: The LSM engine provides no isolation between concurrent readers and writers. A long-running conjunction query can read a mix of old and new TLE versions, producing inconsistent results.
Directive: Implement multi-version concurrency control (MVCC) with snapshot isolation. Every conjunction query must see a consistent snapshot of the catalog — either entirely before or entirely after any concurrent update.
Learning Outcomes
After completing this module, you will be able to:
- Explain the ACID properties and which guarantees are provided by each isolation level
- Implement two-phase locking (2PL) and explain why it prevents all anomalies but limits concurrency
- Implement MVCC snapshot isolation in an LSM engine using timestamped keys
- Explain write skew and why snapshot isolation does not prevent it
- Design a garbage collection strategy for old MVCC versions
- Reason about the tradeoff between isolation level and concurrent throughput
Lesson Summary
Lesson 1 — ACID Properties and Isolation Levels
What Atomicity, Consistency, Isolation, and Durability mean concretely. The isolation levels (Read Uncommitted, Read Committed, Repeatable Read, Serializable) and which anomalies each prevents.
Lesson 2 — Two-Phase Locking (2PL)
Lock-based concurrency control. Shared and exclusive locks, the growing and shrinking phases, strict 2PL, and deadlock detection. Why 2PL is correct but limits throughput.
Lesson 3 — MVCC and Snapshot Isolation
Multi-version concurrency control: storing multiple versions of each key with timestamps. Snapshot reads, write conflicts, and garbage collection. Adapted for the LSM architecture using timestamped keys (following Mini-LSM Week 3).
Capstone Project — Conjunction Query Engine with MVCC Snapshot Reads
Add MVCC snapshot isolation to the LSM engine. Concurrent conjunction queries see consistent catalog snapshots. Concurrent writers do not block readers. Full brief in project-conjunction-engine.md.
File Index
module-05-transactions-isolation/
├── README.md
├── lesson-01-acid-isolation.md
├── lesson-01-quiz.toml
├── lesson-02-two-phase-locking.md
├── lesson-02-quiz.toml
├── lesson-03-mvcc-snapshots.md
├── lesson-03-quiz.toml
└── project-conjunction-engine.md
Prerequisites
- Module 4 (WAL & Recovery) completed
What Comes Next
Module 6 (Query Processing) adds structured query execution on top of the transactional storage engine — the volcano iterator model, vectorized execution, and join algorithms.
Lesson 1 — ACID Properties and Isolation Levels
Module: Database Internals — M05: Transactions & Isolation
Position: Lesson 1 of 3
Source: Database Internals — Alex Petrov, Chapter 12; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 7
Source note: This lesson was synthesized from training knowledge. Verify Kleppmann's isolation level taxonomy and anomaly definitions against Chapter 7.
Context
The OOR's LSM engine from Modules 3–4 provides durable, crash-recoverable storage for TLE records. But it offers no guarantees about what happens when multiple operations execute concurrently. A conjunction query reading NORAD 43013's TLE while a bulk update is overwriting it can see a partially-updated record — or a mix of old and new versions across different objects. The result is a phantom: a conjunction assessment computed against a catalog state that never actually existed.
Transactions are the abstraction that prevents this. A transaction groups multiple operations into a single atomic, isolated unit. The ACID properties define what "correct" means for transactions, and the isolation level determines how strictly concurrent transactions are separated.
Core Concepts
ACID Properties
Atomicity: All operations in a transaction succeed or all fail. If a bulk TLE update covers 500 objects and fails on object 347, the first 346 updates are rolled back. The catalog is never left in a partially-updated state.
Consistency: The database moves from one valid state to another. Application-level invariants (e.g., every NORAD ID is unique, every TLE has a valid epoch) are preserved across transactions. Consistency is primarily the application's responsibility — the database enforces it through constraints.
Isolation: Concurrent transactions appear to execute serially. A conjunction query running alongside a bulk update sees either the entirely pre-update or entirely post-update catalog — never a mix. The isolation level determines how strictly this is enforced.
Durability: Once a transaction commits, its effects survive crashes. This is the WAL's job (Module 4).
Isolation Levels and Anomalies
Each isolation level prevents a specific set of anomalies — situations where concurrent execution produces results that no serial execution could produce.
Read Uncommitted: No isolation. A transaction can read another transaction's uncommitted writes. Vulnerable to dirty reads (reading data that may be rolled back).
Read Committed: A transaction only sees committed data. Prevents dirty reads. Still vulnerable to non-repeatable reads (reading the same key twice and getting different values because another transaction committed between the two reads).
Repeatable Read / Snapshot Isolation: A transaction sees a consistent snapshot taken at transaction start. Prevents dirty reads and non-repeatable reads. Still vulnerable to write skew (two transactions read overlapping data, make disjoint writes, and produce a state that neither would have produced alone).
Serializable: Full isolation — concurrent transactions produce results equivalent to some serial ordering. Prevents all anomalies including write skew. Most expensive to enforce.
| Level | Dirty Read | Non-Repeatable Read | Phantom Read | Write Skew |
|---|---|---|---|---|
| Read Uncommitted | ✗ | ✗ | ✗ | ✗ |
| Read Committed | ✓ | ✗ | ✗ | ✗ |
| Repeatable Read | ✓ | ✓ | ✗/✓ | ✗ |
| Snapshot Isolation | ✓ | ✓ | ✓ | ✗ |
| Serializable | ✓ | ✓ | ✓ | ✓ |
✓ = prevented, ✗ = possible
For the OOR, snapshot isolation is the practical target: conjunction queries need a consistent view of the catalog (preventing dirty reads, non-repeatable reads, and phantoms), but full serializability's overhead is unnecessary for a read-dominated workload.
Write Skew: The Anomaly Snapshot Isolation Misses
Two conjunction analysts each read that the other is on duty. Both decide to go off-duty simultaneously, leaving no one on watch. Each transaction's writes are consistent with its own read snapshot, but the combined result violates the invariant "at least one analyst on duty."
In the OOR context: two concurrent TLE update transactions each read that a different ground station is providing TLE data for NORAD 25544. Each decides to delete the other station's TLE (deduplication). Result: both TLEs are deleted, and the object has no TLE data. Each transaction saw a valid state, but the combined result is invalid.
Snapshot isolation does not prevent this because neither transaction writes a key that the other reads — they write disjoint keys. The conflict is at the application invariant level, not the data access level. Preventing write skew requires serializable isolation (2PL or SSI).
Code Examples
Transaction Interface for the OOR
/// A transaction handle that provides snapshot isolation.
struct Transaction {
/// Snapshot timestamp — all reads see data as of this moment.
read_ts: u64,
/// Commit timestamp — assigned at commit time.
write_ts: Option<u64>,
/// Buffered writes — applied to the engine only on commit.
write_set: Vec<(Vec<u8>, Option<Vec<u8>>)>,
}
impl Transaction {
fn begin(engine: &LsmEngine) -> Self {
Self {
read_ts: engine.current_timestamp(),
write_ts: None,
write_set: Vec::new(),
}
}
/// Read a key as of this transaction's snapshot timestamp.
fn get(&self, key: &[u8], engine: &LsmEngine) -> io::Result<Option<Vec<u8>>> {
// Read the version of the key that was committed at or before read_ts
engine.get_at_timestamp(key, self.read_ts)
}
/// Buffer a write (applied on commit).
fn put(&mut self, key: Vec<u8>, value: Vec<u8>) {
self.write_set.push((key, Some(value)));
}
fn delete(&mut self, key: Vec<u8>) {
self.write_set.push((key, None));
}
/// Commit the transaction: apply all buffered writes atomically.
fn commit(mut self, engine: &mut LsmEngine) -> io::Result<()> {
let write_ts = engine.next_timestamp();
self.write_ts = Some(write_ts);
// Apply all writes with the commit timestamp
for (key, value) in self.write_set {
match value {
Some(val) => engine.put_with_ts(&key, &val, write_ts)?,
None => engine.delete_with_ts(&key, write_ts)?,
}
}
Ok(())
}
}
The key insight: reads use the read_ts (taken at transaction start), so they always see a consistent snapshot. Writes are buffered and applied atomically with a write_ts (taken at commit time). Other transactions that started before write_ts will not see these writes — they read at their own read_ts.
Key Takeaways
- ACID properties define transaction correctness. Atomicity (all-or-nothing), Isolation (concurrent transactions don't interfere), and Durability (committed data survives crashes) are the storage engine's responsibility. Consistency is primarily the application's.
- Snapshot isolation gives each transaction a consistent view of the database as of its start time. This prevents dirty reads, non-repeatable reads, and phantom reads — sufficient for the OOR's conjunction query workload.
- Write skew is the anomaly that snapshot isolation misses. It occurs when two transactions read overlapping data and write disjoint keys, producing a combined result that neither would have produced alone.
- The transaction interface separates read path (snapshot at
read_ts) from write path (buffered, applied atwrite_ts). This is the foundation for MVCC (Lesson 3).
Lesson 2 — Two-Phase Locking (2PL)
Module: Database Internals — M05: Transactions & Isolation
Position: Lesson 2 of 3
Source: Database Internals — Alex Petrov, Chapter 12; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 7
Source note: This lesson was synthesized from training knowledge. Verify Petrov's 2PL description and deadlock detection algorithms against Chapter 12.
Context
Before MVCC became dominant, lock-based concurrency control was the standard approach to transaction isolation. Two-Phase Locking (2PL) is the classical protocol: transactions acquire locks before accessing data, and release them only after completing all operations. The "two phases" are the growing phase (acquiring locks) and the shrinking phase (releasing locks). A transaction never acquires a new lock after releasing any lock.
2PL provides full serializability — the strongest isolation level. But it comes at a cost: writers block readers, readers block writers, and concurrent throughput drops significantly under contention. For the OOR, where conjunction queries must not block TLE ingestion, 2PL's blocking behavior is problematic. Understanding 2PL is essential context for appreciating why MVCC (Lesson 3) is the preferred approach for read-heavy workloads.
Core Concepts
Lock Types
Shared lock (S): Allows the holder to read the locked resource. Multiple transactions can hold shared locks on the same resource simultaneously. Shared locks prevent writes but allow concurrent reads.
Exclusive lock (X): Allows the holder to read and write the locked resource. Only one transaction can hold an exclusive lock at a time. Exclusive locks block both reads and writes from other transactions.
Compatibility matrix:
| S held | X held | |
|---|---|---|
| S requested | ✓ grant | ✗ wait |
| X requested | ✗ wait | ✗ wait |
The Two Phases
Growing phase: The transaction acquires locks as needed (shared for reads, exclusive for writes). It never releases any lock during this phase.
Shrinking phase: After the transaction decides to commit (or abort), it releases all locks. Once any lock is released, no new locks can be acquired.
Strict 2PL is the common variant: all locks are held until the transaction commits or aborts. No locks are released during the shrinking phase — they are all released at once at commit time. This prevents cascading aborts (where one transaction's abort forces other transactions that read its uncommitted data to also abort).
Deadlocks
Two transactions can deadlock if each holds a lock the other needs:
- Transaction A holds an exclusive lock on NORAD 25544 and requests a shared lock on NORAD 43013.
- Transaction B holds an exclusive lock on NORAD 43013 and requests a shared lock on NORAD 25544.
- Neither can proceed. Both are stuck waiting.
Detection strategies:
- Timeout: If a lock wait exceeds a threshold, abort the transaction and retry. Simple but imprecise — the timeout may be too long (wasted time) or too short (false positives).
- Wait-for graph: Maintain a directed graph of which transactions are waiting for which. A cycle in the graph indicates a deadlock. Abort one transaction in the cycle (typically the youngest or the one with the least work done).
2PL Performance Characteristics
Under low contention (few transactions accessing the same keys), 2PL performs well — most lock requests are granted immediately. Under high contention (many transactions accessing overlapping keys), performance degrades:
- Writers block readers: a bulk TLE update holding exclusive locks on 500 objects blocks all conjunction queries that need any of those objects.
- Lock overhead: acquiring and releasing locks, checking the wait-for graph, and managing the lock table all consume CPU.
- Deadlock aborts: wasted work when a deadlock victim is rolled back and retried.
For the OOR's workload — frequent long-running read transactions (conjunction queries) alongside burst write transactions (TLE ingestion) — 2PL would cause conjunction queries to stall during every ingestion burst.
Code Examples
A Simple Lock Manager
use std::collections::HashMap;
use std::sync::{Mutex, Condvar};
#[derive(Debug, Clone, Copy, PartialEq)]
enum LockMode { Shared, Exclusive }
struct LockEntry {
mode: LockMode,
holders: Vec<u64>, // Transaction IDs holding this lock
wait_queue: Vec<(u64, LockMode)>, // Transactions waiting for this lock
}
struct LockManager {
locks: Mutex<HashMap<Vec<u8>, LockEntry>>,
cond: Condvar,
}
impl LockManager {
fn acquire(&self, txn_id: u64, key: &[u8], mode: LockMode) -> bool {
let mut locks = self.locks.lock().unwrap();
loop {
let entry = locks.entry(key.to_vec()).or_insert_with(|| LockEntry {
mode: LockMode::Shared,
holders: Vec::new(),
wait_queue: Vec::new(),
});
let can_grant = match (mode, entry.holders.is_empty()) {
(_, true) => true, // No holders — any mode is fine
(LockMode::Shared, false) => entry.mode == LockMode::Shared,
(LockMode::Exclusive, false) => false,
};
if can_grant {
entry.mode = mode;
entry.holders.push(txn_id);
return true;
}
// Cannot grant — add to wait queue
entry.wait_queue.push((txn_id, mode));
// Block until notified (simplified — real impl checks for deadlock)
locks = self.cond.wait(locks).unwrap();
}
}
fn release(&self, txn_id: u64, key: &[u8]) {
let mut locks = self.locks.lock().unwrap();
if let Some(entry) = locks.get_mut(key) {
entry.holders.retain(|&id| id != txn_id);
if entry.holders.is_empty() {
// Grant to the first waiter
if let Some((waiter_id, waiter_mode)) = entry.wait_queue.first().copied() {
entry.holders.push(waiter_id);
entry.mode = waiter_mode;
entry.wait_queue.remove(0);
}
}
self.cond.notify_all();
}
}
}
This simplified lock manager illustrates the core mechanics. Production lock managers use per-key condition variables (not a single global one), hash-based lock tables for O(1) lookup, and wait-for graph tracking for deadlock detection.
Key Takeaways
- Two-phase locking provides serializable isolation by ensuring transactions acquire all locks before releasing any. Strict 2PL holds all locks until commit.
- 2PL's blocking behavior is the fundamental problem: writers block readers and readers block writers. For read-heavy workloads like conjunction queries, this creates unacceptable stalls during concurrent writes.
- Deadlocks are an inherent risk of lock-based concurrency. Detection via wait-for graphs and resolution via transaction abort are standard but add overhead and wasted work.
- 2PL is still used in some systems (MySQL/InnoDB for certain isolation levels, distributed databases for coordination). Understanding it provides essential context for why MVCC is preferred.
Lesson 3 — MVCC and Snapshot Isolation
Module: Database Internals — M05: Transactions & Isolation
Position: Lesson 3 of 3
Source: Database Internals — Alex Petrov, Chapter 13; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 7; Mini-LSM Week 3
Source note: This lesson was synthesized from training knowledge and the Mini-LSM Week 3 MVCC chapters. Verify Petrov's MVCC version chain description and Kleppmann's snapshot isolation anomaly analysis.
Context
MVCC solves the fundamental problem of 2PL — writers blocking readers — by keeping multiple versions of each key. Instead of locking a key and making other transactions wait, the engine stores every version alongside a timestamp. Readers pick the version that matches their snapshot timestamp; writers create new versions without disturbing old ones. Readers never block writers. Writers never block readers. The only conflict is writer-writer on the same key.
For the LSM architecture, MVCC is a natural fit. The LSM already stores sorted key-value pairs — extending keys to include a timestamp is a straightforward encoding change. Mini-LSM's Week 3 implements exactly this: the key format changes from user_key to (user_key, timestamp), where newer timestamps sort first. A snapshot read at timestamp T scans for the first version of each key with timestamp ≤ T.
Core Concepts
Timestamped Keys in LSM
The MVCC key format encodes the user key and a commit timestamp into a single sortable byte string:
MVCC key = [user_key_bytes] [timestamp as big-endian u64, inverted]
The timestamp is stored as big-endian and bitwise inverted (XOR with u64::MAX) so that newer timestamps sort before older ones in the LSM's byte-order comparison. This means a scan for key "NORAD-25544" encounters the newest version first — exactly what a snapshot read needs.
Example ordering for key "NORAD-25544":
"NORAD-25544" | ts=110 (inverted: 0xFFFFFFFFFFFFFF91) ← newest, sorts first
"NORAD-25544" | ts=80 (inverted: 0xFFFFFFFFFFFFFFAF)
"NORAD-25544" | ts=50 (inverted: 0xFFFFFFFFFFFFFFCD) ← oldest, sorts last
Snapshot Read
A transaction with read_ts = 100 reading key K:
- Seek to the first MVCC key with prefix K.
- Scan versions from newest to oldest.
- Return the first version with
commit_ts ≤ read_ts. - If the first matching version is a tombstone, the key is deleted in this snapshot — return None.
This is O(V) where V is the number of versions of the key. In practice V is small (1–5 for most keys) because compaction garbage-collects old versions.
Write Path with Timestamps
When a transaction commits with write_ts = 105:
- For each key in the write set, create an MVCC key
(user_key, 105). - Write all MVCC key-value pairs to the memtable (via the WAL, as in Module 4).
- These versions become visible to any transaction with
read_ts ≥ 105.
Write-write conflicts: if two transactions both write the same user key, the later commit must detect the conflict. Simple approach: check if any version of the key with commit_ts > txn.read_ts exists at commit time. If so, another transaction has written this key after our snapshot — abort and retry.
Watermark and Garbage Collection
Old versions accumulate. If every write creates a new version, the database grows without bound. Garbage collection removes versions that are no longer visible to any active transaction.
The watermark is the minimum read_ts among all active transactions. Any version with commit_ts < watermark that has a newer version is safe to garbage-collect — no active transaction can ever read it.
Active transactions: read_ts = [100, 150, 200]
Watermark = 100
Key "NORAD-25544" versions:
ts=200 (value_v3) ← keep (above watermark, no newer version)
ts=110 (value_v2) ← keep (above watermark, may be needed by ts=100..149 txns)
ts=80 (value_v1) ← keep (ts=100 txn might need it — 80 ≤ 100 and next version is 110)
ts=30 (value_v0) ← GARBAGE: ts=80 exists and 30 < watermark, so no txn will ever read v0
Wait — v0 at ts=30 is still needed if a txn at ts=50 existed. But watermark is 100, meaning
no transaction has read_ts < 100. So v0 (ts=30) is only needed if some transaction reads at
ts=30..79. Since watermark=100 guarantees no such transaction exists, v0 is safe to collect.
Garbage collection happens during compaction: when the merge iterator encounters multiple versions of the same key, it keeps the newest version per user key that is above the watermark, plus one version at or below the watermark (for transactions at exactly the watermark timestamp). All older versions are dropped.
Write Batch Atomicity
MVCC writes must be atomic — all keys in a transaction get the same write_ts, and they all become visible at once. In the LSM engine, this means all keys in a write batch are logged to the WAL as a single unit and inserted into the memtable together. The write_ts is assigned from a global monotonic counter protected by a mutex (as in Mini-LSM's approach).
Code Examples
MVCC Key Encoding
/// Encode a user key and timestamp into an MVCC key.
/// Timestamps are inverted so newer versions sort first.
fn encode_mvcc_key(user_key: &[u8], timestamp: u64) -> Vec<u8> {
let mut key = Vec::with_capacity(user_key.len() + 8);
key.extend_from_slice(user_key);
// Invert timestamp: newer (larger) timestamps become smaller bytes,
// sorting first in ascending byte order.
key.extend_from_slice(&(!timestamp).to_be_bytes());
key
}
/// Decode an MVCC key back into user key and timestamp.
fn decode_mvcc_key(mvcc_key: &[u8]) -> (&[u8], u64) {
let ts_start = mvcc_key.len() - 8;
let user_key = &mvcc_key[..ts_start];
let inverted_ts = u64::from_be_bytes(
mvcc_key[ts_start..].try_into().unwrap()
);
(user_key, !inverted_ts)
}
Snapshot Read with MVCC
impl LsmEngine {
/// Read the value of a key at the given snapshot timestamp.
fn get_at_timestamp(
&self,
user_key: &[u8],
read_ts: u64,
) -> io::Result<Option<Vec<u8>>> {
// Seek to the newest version of this key
let seek_key = encode_mvcc_key(user_key, u64::MAX);
// Create a merge iterator over memtable + SSTables
let mut iter = self.create_merge_iterator(&seek_key)?;
while let Some((mvcc_key, value)) = iter.next()? {
let (key, ts) = decode_mvcc_key(&mvcc_key);
// Stop if we've moved past this user key
if key != user_key {
return Ok(None);
}
// Skip versions newer than our snapshot
if ts > read_ts {
continue;
}
// This is the newest version visible to us
return match value {
Some(val) => Ok(Some(val)),
None => Ok(None), // Tombstone — key is deleted in this snapshot
};
}
Ok(None) // Key not found in any source
}
}
The seek to (user_key, u64::MAX) positions the iterator at the newest possible version of the key (since u64::MAX is the largest timestamp, and inverted it becomes the smallest byte value). The iterator then scans backward through versions until it finds one with ts ≤ read_ts.
Key Takeaways
- MVCC stores multiple versions of each key with commit timestamps. Readers select the version matching their snapshot timestamp. Writers create new versions without disturbing old ones.
- In the LSM architecture, MVCC keys are encoded as
(user_key, inverted_timestamp)so that newer versions sort first in byte order. This makes snapshot reads efficient — the first matching version is the correct one. - Readers never block writers, and writers never block readers. The only conflict is write-write on the same key, detected at commit time.
- Garbage collection removes old versions that are below the watermark (minimum active
read_ts). This happens during compaction and is essential for bounding space amplification. - Write skew remains possible under snapshot isolation. For the OOR, this is an acceptable tradeoff — conjunction queries need consistent snapshots, not full serializability.
Project — Conjunction Query Engine with MVCC Snapshot Reads
Module: Database Internals — M05: Transactions & Isolation
Track: Orbital Object Registry
Estimated effort: 8–10 hours
SDA Incident Report — OOR-2026-0046
Classification: ENGINEERING DIRECTIVE
Subject: Add MVCC snapshot isolation to the OOR storage engineRef: OOR-2026-0046 (stale TLE data in conjunction assessment)
Extend the LSM engine with MVCC support. Conjunction queries must see consistent catalog snapshots. Concurrent TLE updates must not block or corrupt reads.
Acceptance Criteria
-
MVCC key encoding. Encode user keys with inverted big-endian timestamps. Verify that newer versions sort before older versions in byte order.
-
Snapshot read correctness. Insert key "NORAD-25544" at timestamps 50, 80, and 110. Read at timestamps 60, 90, and 120. Verify each read returns the correct version (ts=50, ts=80, ts=110 respectively).
-
Tombstone visibility. Insert key "NORAD-99999" at ts=50, delete at ts=80. Read at ts=60 → value. Read at ts=90 → None.
-
Concurrent reads and writes. Spawn two threads: one performs 10,000 reads at a fixed snapshot, the other performs 1,000 writes with incrementing timestamps. Verify all reads return consistent results (no torn reads, no version mixing). Writers must not block readers.
-
Write conflict detection. Start two transactions with overlapping read timestamps. Both write the same key. The first to commit succeeds; the second detects the conflict and is aborted.
-
Garbage collection. Set watermark to 100. Insert versions at ts=30, 70, 90, 120 for a key. Run compaction. Verify that ts=30 is garbage-collected, ts=70 and ts=90 are retained (safety margin), and ts=120 is retained.
-
Conjunction simulation. Load 10,000 TLE records. Start a conjunction query (snapshot read over 100 objects). While the query is running, update 50 of those objects. Verify the query sees only the pre-update versions.
Starter Structure
conjunction-engine/
├── Cargo.toml
├── src/
│ ├── main.rs # Entry point
│ ├── mvcc.rs # MVCC key encoding, Transaction, conflict detection
│ ├── lsm.rs # Extended with timestamp-aware get/put/scan
│ ├── compaction.rs # Extended with watermark-aware GC
│ └── (reuse remaining modules from Modules 3–4)
Hints
Hint 1 — Timestamp encoding
Use !timestamp (bitwise NOT) converted to big-endian bytes, appended to the user key. This makes newer timestamps sort first without modifying the LSM's comparator.
Hint 2 — Write conflict detection
At commit time, scan the memtable and SSTables for any version of the key with commit_ts > txn.read_ts. If found, another transaction wrote this key after our snapshot — abort.
Hint 3 — Watermark computation
Maintain a BTreeSet<u64> of all active transaction read timestamps. The watermark is the minimum value in the set. When a transaction commits or aborts, remove its read_ts. Use a mutex to protect the set.
What Comes Next
Module 6 (Query Processing) builds structured query execution on top of the MVCC storage engine — scan operators, join algorithms, and the volcano iterator model for composable query plans.
Module 06 — Query Processing
Track: Database Internals — Orbital Object Registry
Position: Module 6 of 6
Source material: Database Internals — Alex Petrov, Chapters 14–15; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 6
Quiz pass threshold: 70% on all three lessons to unlock the project
Mission Context
SDA INCIDENT REPORT — OOR-2026-0047
Classification: PERFORMANCE DEFICIENCY
Subject: Multi-source TLE merge exceeds conjunction window deadlineThe OOR ingests TLE data from 5 independent sources (18th SDS, ESA SST, LeoLabs, ExoAnalytic, Numerica). When multiple sources provide TLEs for the same object, the engine must merge them — selecting the most recent epoch, resolving conflicts, and joining against the master catalog. Currently this is done in application code with ad-hoc nested loops. A full catalog merge of 100,000 objects from 5 sources takes 45 seconds. The conjunction pipeline requires merge results within 10 seconds.
Directive: Implement a structured query processing layer: scan operators, join algorithms, and a composable execution model that can be optimized for the catalog merge workload.
Learning Outcomes
After completing this module, you will be able to:
- Implement the volcano (iterator) model for pull-based query execution with composable operators
- Explain vectorized execution and why processing column batches outperforms row-at-a-time for analytical queries
- Implement nested-loop, hash, and sort-merge join algorithms and determine which is optimal for a given workload
- Compose scan, filter, projection, and join operators into a query execution plan
- Analyze the I/O and memory costs of different join strategies for the OOR catalog merge workload
Lesson Summary
Lesson 1 — The Volcano (Iterator) Model
Pull-based query execution. Each operator (scan, filter, join) implements next() → Option<Row>. Operators compose like iterator chains. Pipelining and its limitations.
Lesson 2 — Vectorized Execution
Processing batches of rows (column vectors) instead of one row at a time. Cache efficiency, SIMD potential, and why OLAP engines (DuckDB, Velox, DataFusion) use vectorized execution.
Lesson 3 — Join Algorithms
Nested-loop join, hash join, and sort-merge join. Cost models, memory requirements, and when each algorithm is optimal. Application to the OOR multi-source TLE merge.
Capstone Project — Orbital Catalog Merge System
Build a query execution engine that merges TLE data from 5 sources using composable operators. The merge pipeline uses scan, filter, sort, and join operators composed in the volcano model. Full brief in project-catalog-merge.md.
File Index
module-06-query-processing/
├── README.md
├── lesson-01-volcano-model.md
├── lesson-01-quiz.toml
├── lesson-02-vectorized-execution.md
├── lesson-02-quiz.toml
├── lesson-03-join-algorithms.md
├── lesson-03-quiz.toml
└── project-catalog-merge.md
Prerequisites
- Module 5 (Transactions & Isolation) completed
- All previous modules in the Database Internals track
Track Complete
This is the final module of the Database Internals track. After completing it, you will have built a storage engine from the ground up: page layout (M1) → B-tree indexing (M2) → LSM write-optimized storage (M3) → crash recovery (M4) → MVCC concurrency (M5) → query processing (M6).
[[questions]]
type = "MultipleChoice"
prompt.prompt = """
A query plan has the structure: Project → Filter → Sort → SeqScan. The SeqScan produces
100,000 TLE rows. How many rows does the Sort operator need to buffer before it can emit its
first output row?
"""
prompt.distractors = [
"0 — Sort is a pipelining operator that emits rows as they arrive.",
"1 — Sort only needs to buffer the current minimum row.",
"50,000 — Sort buffers half the input and emits the sorted half."
]
answer.answer = "All 100,000. Sort is a pipeline breaker — it must consume every input row before it can determine which row is first in the sorted order. Only after reading all 100,000 rows can it begin emitting output."
context = """
Sort is the canonical pipeline breaker. It cannot emit the smallest row until it has seen all
rows — any subsequent input row could be smaller than the current minimum. The entire dataset
must be materialized in memory (or spilled to disk for large datasets). This means the Sort
operator's memory consumption is O(N), and the operators above it (Filter, Project) see no
output until all 100,000 rows have been consumed. Pipeline breakers are the main source of
latency and memory pressure in query execution plans.
"""
[[questions]]
type = "MultipleChoice"
prompt.prompt = """
A Filter operator with selectivity 1% sits above a SeqScan over 100,000 rows. In the volcano
model, how many times does the Filter's next() call the SeqScan's next()?
"""
prompt.distractors = [
"1,000 — once per matching row.",
"1 — the Filter batches its requests to the SeqScan.",
"It depends on whether the query has a LIMIT clause."
]
answer.answer = "100,000 — the Filter calls SeqScan.next() for every row and discards the 99% that don't match. The Filter examines all rows to find the 1% that pass the predicate."
context = """
In the volcano model, the Filter operator has no way to skip rows — it must pull every row from
its child and evaluate the predicate. Even with 1% selectivity, all 100,000 rows are produced by
the SeqScan and examined by the Filter. This is one of the model's inefficiencies: a filter
cannot 'push down' a predicate into the scan to skip irrelevant rows (though index scans achieve
this by using the B+ tree to jump directly to matching keys). A LIMIT clause on the root operator
would allow early termination once enough matching rows are found, but the Filter itself doesn't
know about LIMIT — it just responds to next() calls.
"""
[[questions]] type = "Tracing" prompt.program = """ fn main() { // Simulate volcano model execution let data = vec![10, 25, 30, 45, 50]; let mut cursor = 0; let mut output = Vec::new();
// Filter: value > 20, then Project: value * 2
loop {
// SeqScan.next()
if cursor >= data.len() { break; }
let row = data[cursor];
cursor += 1;
// Filter
if row <= 20 { continue; }
// Project
let projected = row * 2;
output.push(projected);
}
println!("{:?}", output);
} """ answer.doesCompile = true answer.stdout = "[50, 60, 90, 100]" context = """ SeqScan produces: 10, 25, 30, 45, 50. Filter (> 20) passes: 25, 30, 45, 50 (rejects 10). Project (* 2) transforms: 50, 60, 90, 100. The volcano model processes one row at a time through the pipeline. Row 10 is scanned, fails the filter, and is discarded. Row 25 is scanned, passes the filter, and is projected to 50. And so on. The output is [50, 60, 90, 100]. """
[[questions]]
type = "MultipleChoice"
prompt.prompt = """
The OOR query engine uses Box<dyn Operator> for composable operator trees. An engineer
profiles a CPU-bound catalog merge and finds 40% of CPU time is spent in virtual dispatch
overhead from next() calls. What is the most effective optimization?
"""
prompt.distractors = [
"Replace trait objects with enum dispatch — this eliminates virtual dispatch.",
"Use unsafe to skip the dynamic dispatch entirely.",
"Add a prefetch() method to the Operator trait to warm the CPU cache."
]
answer.answer = "Switch to vectorized execution (Lesson 2) — process batches of rows per next() call instead of single rows. This amortizes the dispatch overhead across hundreds of rows per call."
context = """
Enum dispatch eliminates vtable indirection but doesn't address the fundamental problem: calling
next() 100,000 times with one row each. Vectorized execution calls next() ~200 times with 500
rows each — the same total rows but 500x fewer function calls. The per-call overhead (virtual
dispatch, function entry/exit, branch prediction) is amortized across the batch. This is why
DuckDB, DataFusion, and Velox all use vectorized execution for analytical queries. Unsafe code
is never the answer to algorithmic overhead.
"""
[[questions]]
type = "MultipleChoice"
prompt.prompt = """
A query plan with 5 chained operators (Scan → Filter → Filter → Project → Project) processes
100,000 rows. Under the volcano model, approximately how many function calls are made in total
for next() across all operators?
"""
prompt.distractors = [
"100,000 — one call at the root pulls the entire chain.",
"5 — one next() call per operator.",
"200,000 — the Scan produces 100,000 and the first Filter doubles them."
]
answer.answer = "Up to 500,000 — each of the 5 operators calls next() up to 100,000 times (once per input row). The root calls the second Project 100,000 times, which calls Filter2, which calls Filter1, which calls Scan — 5 × 100,000."
context = """
In the worst case (all rows pass all filters), every operator's next() is called once per input
row. The root operator calls its child 100,000 times, which calls its child 100,000 times, and
so on. Total: 5 × 100,000 = 500,000 next() calls. If the filters are selective (e.g., first
filter passes 10%), the upper operators are called fewer times (10,000 for everything above the
first filter), reducing total calls. But the scan and first filter always process all 100,000
rows. This per-row overhead is why vectorized execution (processing batches) is critical for
CPU-bound workloads.
"""
Lesson 2 — Vectorized Execution
Module: Database Internals — M06: Query Processing
Position: Lesson 2 of 3
Source: Designing Data-Intensive Applications — Martin Kleppmann, Chapter 6
Source note: This lesson was synthesized from training knowledge. Verify Kleppmann's columnar processing analysis against Chapter 6. Additional references: MonetDB/X100 paper (Boncz et al., 2005), DuckDB architecture documentation.
Context
The volcano model processes one row at a time. For the OOR's catalog merge — 500,000 rows from 5 sources, with comparison logic on 4 floating-point fields — the per-row function call overhead and cache inefficiency dominate CPU time. Vectorized execution addresses this by changing the unit of work from a single row to a batch (or vector) of rows.
Instead of next() → Option<Row>, vectorized operators return next_batch() → Option<ColumnBatch>, where a ColumnBatch contains 256–2048 rows stored in columnar format (one array per column). Operators process entire columns at once — a filter evaluates a predicate on an f64 array of 1024 inclination values in a tight loop, producing a selection bitmap. This tight loop is cache-friendly (one contiguous array), branch-predictor-friendly (same instruction repeated), and SIMD-exploitable (process 4 or 8 values per instruction).
Core Concepts
Row-at-a-Time vs. Batch-at-a-Time
Row-at-a-time (volcano): Each operator call processes 1 row. For N rows through K operators: N × K function calls, each touching a different memory region.
Batch-at-a-time (vectorized): Each operator call processes B rows. For N rows through K operators: (N/B) × K function calls. Each call processes a contiguous array, keeping the CPU cache warm and enabling auto-vectorization (SIMD).
The batch size B is typically 1024 or 2048 — large enough to amortize per-call overhead, small enough to fit in L1/L2 cache.
Columnar Batch Format
A vectorized batch stores data in columnar layout — one array per column:
Row-oriented (volcano):
Row 0: { norad_id: 25544, epoch: 84.7, inclination: 51.6, ... }
Row 1: { norad_id: 43013, epoch: 84.2, inclination: 97.4, ... }
Columnar (vectorized):
norad_id: [25544, 43013, ...] ← contiguous u32 array
epoch: [84.7, 84.2, ...] ← contiguous f64 array
inclination: [51.6, 97.4, ...] ← contiguous f64 array
A filter on inclination > 80.0 processes the inclination array without touching norad_id or epoch — only the relevant column is loaded into cache. The Rust compiler can auto-vectorize the tight comparison loop into SIMD instructions (e.g., _mm256_cmp_pd comparing 4 f64 values per instruction).
Selection Vectors
Instead of copying matching rows to a new batch (expensive), vectorized engines use a selection vector — an array of indices into the batch that identifies which rows passed the filter:
// Filter: inclination > 80.0
let inclinations: &[f64] = &batch.inclination;
let mut selection: Vec<u32> = Vec::new();
for (i, &inc) in inclinations.iter().enumerate() {
if inc > 80.0 {
selection.push(i as u32);
}
}
// selection = [1, 3, 7, ...] ← indices of matching rows
Downstream operators use the selection vector to skip non-matching rows without copying data. This avoids the memory allocation and copy cost of materializing filtered batches.
Apache Arrow as the Columnar Format
Apache Arrow defines a standardized in-memory columnar format used by DataFusion, DuckDB (internally similar), Polars, and many data processing engines. Key features:
- Zero-copy sharing between operators — no serialization/deserialization between pipeline stages.
- Validity bitmaps for null handling — one bit per value indicating null/non-null.
- Dictionary encoding for low-cardinality string columns — stores unique values once and references them by index.
For the OOR, using Arrow arrays for TLE column batches enables integration with the broader Rust data ecosystem (the arrow crate).
Code Examples
Vectorized Filter Operator
const BATCH_SIZE: usize = 1024;
/// A batch of TLE records in columnar format.
struct TleBatch {
norad_ids: Vec<u32>,
epochs: Vec<f64>,
inclinations: Vec<f64>,
mean_motions: Vec<f64>,
len: usize,
}
/// Vectorized operator interface.
trait VecOperator {
fn open(&mut self) -> io::Result<()>;
fn next_batch(&mut self) -> io::Result<Option<TleBatch>>;
fn close(&mut self) -> io::Result<()>;
}
/// Vectorized filter: evaluates predicate on entire columns at once.
struct VecFilter {
child: Box<dyn VecOperator>,
/// Returns a boolean mask: true for rows that pass the filter.
predicate: Box<dyn Fn(&TleBatch) -> Vec<bool>>,
}
impl VecOperator for VecFilter {
fn open(&mut self) -> io::Result<()> { self.child.open() }
fn next_batch(&mut self) -> io::Result<Option<TleBatch>> {
loop {
match self.child.next_batch()? {
None => return Ok(None),
Some(batch) => {
let mask = (self.predicate)(&batch);
let filtered = apply_mask(&batch, &mask);
if filtered.len > 0 {
return Ok(Some(filtered));
}
// Entire batch filtered out — pull next
}
}
}
}
fn close(&mut self) -> io::Result<()> { self.child.close() }
}
/// Apply a boolean mask to a batch, keeping only rows where mask[i] is true.
fn apply_mask(batch: &TleBatch, mask: &[bool]) -> TleBatch {
let mut out = TleBatch {
norad_ids: Vec::new(), epochs: Vec::new(),
inclinations: Vec::new(), mean_motions: Vec::new(), len: 0,
};
for (i, &keep) in mask.iter().enumerate() {
if keep {
out.norad_ids.push(batch.norad_ids[i]);
out.epochs.push(batch.epochs[i]);
out.inclinations.push(batch.inclinations[i]);
out.mean_motions.push(batch.mean_motions[i]);
out.len += 1;
}
}
out
}
The predicate function operates on entire columns: |batch| batch.inclinations.iter().map(|&inc| inc > 80.0).collect(). This tight loop over a contiguous f64 array is exactly the pattern the compiler auto-vectorizes into SIMD instructions. A production implementation would use selection vectors instead of copying rows.
Key Takeaways
- Vectorized execution processes batches of rows (typically 1024) instead of single rows, reducing per-row overhead by 100–1000x for CPU-bound operations.
- Columnar layout stores each column as a contiguous array, enabling cache-efficient processing and SIMD auto-vectorization. A filter on one column never touches other columns.
- Selection vectors track which rows pass a filter without copying data. This avoids materialization cost and keeps downstream operators working on the original arrays.
- Apache Arrow provides a standardized columnar format for zero-copy interop between operators and libraries. The
arrowRust crate is the foundation for DataFusion. - Vectorized execution is most impactful for CPU-bound analytical queries (aggregations, joins, comparisons). For I/O-bound point lookups, the volcano model is sufficient.
Lesson 3 — Join Algorithms
Module: Database Internals — M06: Query Processing
Position: Lesson 3 of 3
Source: Database Internals — Alex Petrov, Chapter 15; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 6
Source note: This lesson was synthesized from training knowledge. Verify Petrov's join algorithm cost models and Kleppmann's distributed join discussion against the source chapters.
Context
The OOR's catalog merge problem is fundamentally a join: match TLE records from 5 sources on NORAD catalog ID, then select the best TLE for each object (most recent epoch, highest source priority). In SQL terms: SELECT * FROM source_a JOIN source_b ON a.norad_id = b.norad_id.
The choice of join algorithm determines whether this merge takes 45 seconds (nested-loop) or under 1 second (hash join). This lesson covers the three fundamental join algorithms, their cost models, and when each is optimal.
Core Concepts
Nested-Loop Join
The simplest join: for each row in the outer table, scan the entire inner table for matches.
for each row_a in source_a: # |A| iterations
for each row_b in source_b: # |B| iterations per outer row
if row_a.norad_id == row_b.norad_id:
emit (row_a, row_b)
Cost: O(|A| × |B|) comparisons. For two 100k-row sources: 10 billion comparisons. Completely impractical for the OOR catalog merge.
When to use: Only when one side is very small (< 100 rows) or when no better algorithm is available (no index, insufficient memory for a hash table). Also useful for non-equi joins (e.g., a.epoch > b.epoch) where hash join doesn't apply.
Block nested-loop improves this by loading a block of the outer table into memory and scanning the inner table once per block. This reduces I/O from |A| × (inner scans) to |A|/B × (inner scans).
Hash Join
Build a hash table on the smaller input (the build side), then probe it with the larger input (the probe side).
Build phase: Scan the build side and insert each row into a hash table keyed by the join column.
Probe phase: Scan the probe side. For each row, hash the join column and look up matching rows in the hash table.
Build: hash_table = {}
for each row_b in source_b:
hash_table[row_b.norad_id].append(row_b)
Probe:
for each row_a in source_a:
for each row_b in hash_table[row_a.norad_id]:
emit (row_a, row_b)
Cost: O(|A| + |B|) — one scan of each input. Hash table operations are O(1) amortized.
Memory: The hash table must fit in memory. Size ≈ |build_side| × (key_size + row_size + overhead). For 100k TLE records at ~100 bytes each: ~10MB. Trivially fits in memory.
When to use: Equi-joins (join on equality) where the build side fits in memory. This is the default join algorithm in most query engines for good reason — it's optimal for the vast majority of join workloads.
For the OOR: hash join merges two 100k-row sources in ~200k operations. Five sources require 4 sequential hash joins (or a multi-way hash join), all completing in under 100ms.
Sort-Merge Join
Sort both inputs on the join column, then merge them in a single pass (like the LSM merge iterator from Module 3).
Sort phase: Sort both inputs by the join key. O(|A| log |A| + |B| log |B|).
Merge phase: Advance two cursors through the sorted inputs, matching on the join key. O(|A| + |B|).
Sort source_a by norad_id
Sort source_b by norad_id
cursor_a = 0, cursor_b = 0
while cursor_a < |A| and cursor_b < |B|:
if a[cursor_a].norad_id == b[cursor_b].norad_id:
emit (a[cursor_a], b[cursor_b])
advance both cursors (handling duplicates)
elif a[cursor_a].norad_id < b[cursor_b].norad_id:
cursor_a += 1
else:
cursor_b += 1
Cost: O(|A| log |A| + |B| log |B|) for the sort phases, O(|A| + |B|) for the merge. Dominated by the sort.
When to use: When inputs are already sorted (e.g., from an index scan or a preceding sort operator), the sort phase is free and the total cost is O(|A| + |B|) — optimal. Also useful when the join result must be sorted (the output is already in join-key order). Handles non-memory-fitting inputs gracefully via external sort.
For the OOR: if TLE sources are pre-sorted by NORAD ID (which they often are, since NORAD IDs are sequential), sort-merge join is optimal — the sort phase costs nothing, and the merge is a single linear pass.
Cost Comparison
| Algorithm | Time | Memory | Pre-sorted Input |
|---|---|---|---|
| Nested-loop | O(A × B) | O(1) | No benefit |
| Hash join | O(A + B) | O(min(A,B)) | No benefit |
| Sort-merge | O(A log A + B log B) | O(A + B) for sort | O(A + B) if pre-sorted |
Multi-Way Join for 5 Sources
The OOR catalog merge joins 5 sources. Strategies:
Sequential pairwise: Join source 1 with 2, then result with 3, then with 4, then with 5. Four hash joins. Total cost: O(5 × N) where N is the source size. Simple and effective.
Multi-way sort-merge: Sort all 5 sources by NORAD ID, then merge all 5 simultaneously using a priority queue (exactly the merge iterator from Module 3). One pass through all data. Optimal if sources are pre-sorted.
For the OOR, the multi-way sort-merge is the better choice: TLE sources arrive pre-sorted by NORAD ID, and the merge iterator is already implemented.
Code Examples
Hash Join Implementation
use std::collections::HashMap;
/// Hash join: match TLE records from two sources on NORAD ID.
fn hash_join(
build_side: &[TleRow], // Smaller source
probe_side: &[TleRow], // Larger source
) -> Vec<(TleRow, TleRow)> {
// Build phase: index the build side by NORAD ID
let mut hash_table: HashMap<u32, Vec<&TleRow>> = HashMap::new();
for row in build_side {
hash_table.entry(row.norad_id).or_default().push(row);
}
// Probe phase: look up each probe-side row in the hash table
let mut results = Vec::new();
for probe_row in probe_side {
if let Some(matches) = hash_table.get(&probe_row.norad_id) {
for &build_row in matches {
results.push((build_row.clone(), probe_row.clone()));
}
}
}
results
}
Sort-Merge Join for Pre-Sorted Sources
/// Sort-merge join on pre-sorted inputs. Both inputs must be sorted by norad_id.
fn sort_merge_join(
left: &[TleRow],
right: &[TleRow],
) -> Vec<(TleRow, TleRow)> {
let mut results = Vec::new();
let mut li = 0;
let mut ri = 0;
while li < left.len() && ri < right.len() {
match left[li].norad_id.cmp(&right[ri].norad_id) {
std::cmp::Ordering::Equal => {
// Collect all rows with this key from both sides
let key = left[li].norad_id;
let l_start = li;
while li < left.len() && left[li].norad_id == key { li += 1; }
let r_start = ri;
while ri < right.len() && right[ri].norad_id == key { ri += 1; }
// Cross product of matching rows (for equi-join)
for l in &left[l_start..li] {
for r in &right[r_start..ri] {
results.push((l.clone(), r.clone()));
}
}
}
std::cmp::Ordering::Less => li += 1,
std::cmp::Ordering::Greater => ri += 1,
}
}
results
}
The sort-merge join's merge phase is identical to the LSM merge iterator logic. For the OOR's unique NORAD IDs (no duplicates within a source), the cross-product in the equal case always produces exactly one match — the merge is linear.
Key Takeaways
- Nested-loop join is O(A × B) — only viable for very small inputs. Hash join is O(A + B) with O(min(A,B)) memory. Sort-merge join is O(A + B) if inputs are pre-sorted.
- Hash join is the default for equi-joins in most query engines. It requires the build side to fit in memory, which is almost always true for the OOR's workload sizes.
- Sort-merge join is optimal when inputs are pre-sorted (the sort phase is free). The LSM merge iterator from Module 3 is already a sort-merge join — the same algorithm applies here.
- The OOR catalog merge (5 pre-sorted sources × 100k objects) is best served by a multi-way sort-merge: one linear pass through all sources using a merge iterator with a priority queue.
- Join algorithm selection is a query optimization decision. The execution engine should support all three algorithms and choose based on input sizes, sort order, and available memory.
Project — Orbital Catalog Merge System
Module: Database Internals — M06: Query Processing
Track: Orbital Object Registry
Estimated effort: 6–8 hours
- SDA Incident Report — OOR-2026-0047
- Acceptance Criteria
- Starter Structure
- Test Data Generation
- Hints
- Track Complete
SDA Incident Report — OOR-2026-0047
Classification: ENGINEERING DIRECTIVE
Subject: Build a structured query engine for multi-source TLE catalog mergingReplace the ad-hoc nested-loop catalog merge with a composable query execution engine. The engine must support scan, filter, sort, and join operators, and merge TLE data from 5 sources within the conjunction pipeline's 10-second deadline.
Acceptance Criteria
-
Volcano operators. Implement
SeqScan,Filter,Projection, andSortoperators using theOperatortrait. Compose them into a plan that scans 100,000 TLE records, filters by inclination > 80°, and projects to (norad_id, epoch). -
Hash join. Implement a hash join operator. Join two 100k-row sources on NORAD ID. Verify the output contains exactly the matching pairs.
-
Sort-merge join. Implement a sort-merge join for pre-sorted inputs. Join two 100k-row sources (pre-sorted by NORAD ID). Verify output matches the hash join result.
-
Multi-way merge. Implement a 5-way merge join using a min-heap (reuse the merge iterator pattern from Module 3). Merge 5 sources of 100k records each, all sorted by NORAD ID. Verify the merged output is sorted and contains all matching records.
-
Performance target. The 5-way merge of 500,000 total records must complete in under 2 seconds. Print elapsed time. Compare against a naive nested-loop join on a subset (1,000 records per source) and report the speedup.
-
Conflict resolution. When multiple sources provide TLEs for the same NORAD ID, select the TLE with the most recent epoch. Print the number of conflicts resolved and the winning source for 10 sample objects.
-
Vectorized filter (bonus). Implement a vectorized filter that processes batches of 1024 rows in columnar format. Compare its throughput to the row-at-a-time volcano filter on 100,000 rows.
Starter Structure
catalog-merge/
├── Cargo.toml
├── src/
│ ├── main.rs # Entry point
│ ├── operator.rs # Operator trait, SeqScan, Filter, Projection, Sort
│ ├── hash_join.rs # HashJoinOperator
│ ├── sort_merge.rs # SortMergeJoinOperator
│ ├── merge_iter.rs # MultiWayMerge (reuse from Module 3)
│ ├── vectorized.rs # VecFilter (bonus)
│ └── tle.rs # TleRow, test data generation
Test Data Generation
Generate synthetic TLE data for 5 sources:
fn generate_source(source_name: &str, num_objects: usize) -> Vec<TleRow> {
let mut rng = /* deterministic seed per source */;
(0..num_objects).map(|i| TleRow {
norad_id: i as u32 + 1, // NORAD IDs 1..100000
epoch: 84.0 + rng.gen::<f64>() * 0.5, // Slight epoch variation per source
inclination: rng.gen::<f64>() * 180.0,
mean_motion: 14.0 + rng.gen::<f64>() * 2.0,
source: source_name.to_string(),
}).collect()
}
Each source provides TLEs for the same 100k objects but with slightly different epochs and measurements. The merge resolves conflicts by picking the most recent epoch per NORAD ID.
Hints
Hint 1 — Hash join operator as a volcano operator
The hash join operator's open() consumes the entire build side (calling build_child.next() until None, building the hash table). Then next() probes one row at a time from the probe side. The build phase is a pipeline breaker; the probe phase pipelines.
Hint 2 — Conflict resolution as a post-merge step
After the 5-way merge produces groups of TLEs with the same NORAD ID, apply a "group-by" operator that collects all rows with the same key and emits the winner (most recent epoch). This is a simple reduce over each group.
Hint 3 — Performance measurement
Use std::time::Instant for timing. Measure the merge end-to-end (including any sort phases). For the nested-loop comparison, use a small subset (1,000 rows per source) to avoid waiting minutes.
Track Complete
Congratulations. You have built a storage engine from the ground up:
| Module | What You Built |
|---|---|
| M1: Storage Engine Fundamentals | Page layout, buffer pool, slotted pages |
| M2: B-Tree Index Structures | B+ tree with splits, merges, range scans |
| M3: LSM Trees & Compaction | Memtable, SSTables, leveled compaction, bloom filters |
| M4: WAL & Recovery | Write-ahead log, crash recovery, checkpointing |
| M5: Transactions & Isolation | MVCC snapshot isolation, write conflict detection, GC |
| M6: Query Processing | Volcano model, vectorized execution, hash/sort-merge joins |
The Orbital Object Registry is now a fully functional, crash-recoverable, transactional storage engine with indexed access and structured query execution. The ESA deadline has been met.