Volcano Model - Meridian Space

[[questions]] type = "MultipleChoice" prompt.prompt = """ A query plan has the structure: Project → Filter → Sort → SeqScan. The SeqScan produces 100,000 TLE rows. How many rows does the Sort operator need to buffer before it can emit its first output row? """ prompt.distractors = [ "0 — Sort is a pipelining operator that emits rows as they arrive.", "1 — Sort only needs to buffer the current minimum row.", "50,000 — Sort buffers half the input and emits the sorted half." ] answer.answer = "All 100,000. Sort is a pipeline breaker — it must consume every input row before it can determine which row is first in the sorted order. Only after reading all 100,000 rows can it begin emitting output." context = """ Sort is the canonical pipeline breaker. It cannot emit the smallest row until it has seen all rows — any subsequent input row could be smaller than the current minimum. The entire dataset must be materialized in memory (or spilled to disk for large datasets). This means the Sort operator's memory consumption is O(N), and the operators above it (Filter, Project) see no output until all 100,000 rows have been consumed. Pipeline breakers are the main source of latency and memory pressure in query execution plans. """

[[questions]] type = "MultipleChoice" prompt.prompt = """ A Filter operator with selectivity 1% sits above a SeqScan over 100,000 rows. In the volcano model, how many times does the Filter's next() call the SeqScan's next()? """ prompt.distractors = [ "1,000 — once per matching row.", "1 — the Filter batches its requests to the SeqScan.", "It depends on whether the query has a LIMIT clause." ] answer.answer = "100,000 — the Filter calls SeqScan.next() for every row and discards the 99% that don't match. The Filter examines all rows to find the 1% that pass the predicate." context = """ In the volcano model, the Filter operator has no way to skip rows — it must pull every row from its child and evaluate the predicate. Even with 1% selectivity, all 100,000 rows are produced by the SeqScan and examined by the Filter. This is one of the model's inefficiencies: a filter cannot 'push down' a predicate into the scan to skip irrelevant rows (though index scans achieve this by using the B+ tree to jump directly to matching keys). A LIMIT clause on the root operator would allow early termination once enough matching rows are found, but the Filter itself doesn't know about LIMIT — it just responds to next() calls. """

[[questions]] type = "Tracing" prompt.program = """ fn main() { // Simulate volcano model execution let data = vec![10, 25, 30, 45, 50]; let mut cursor = 0; let mut output = Vec::new();

// Filter: value > 20, then Project: value * 2
loop {
    // SeqScan.next()
    if cursor >= data.len() { break; }
    let row = data[cursor];
    cursor += 1;

    // Filter
    if row <= 20 { continue; }

    // Project
    let projected = row * 2;
    output.push(projected);
}

println!("{:?}", output);

} """ answer.doesCompile = true answer.stdout = "[50, 60, 90, 100]" context = """ SeqScan produces: 10, 25, 30, 45, 50. Filter (> 20) passes: 25, 30, 45, 50 (rejects 10). Project (* 2) transforms: 50, 60, 90, 100. The volcano model processes one row at a time through the pipeline. Row 10 is scanned, fails the filter, and is discarded. Row 25 is scanned, passes the filter, and is projected to 50. And so on. The output is [50, 60, 90, 100]. """

[[questions]] type = "MultipleChoice" prompt.prompt = """ The OOR query engine uses Box<dyn Operator> for composable operator trees. An engineer profiles a CPU-bound catalog merge and finds 40% of CPU time is spent in virtual dispatch overhead from next() calls. What is the most effective optimization? """ prompt.distractors = [ "Replace trait objects with enum dispatch — this eliminates virtual dispatch.", "Use unsafe to skip the dynamic dispatch entirely.", "Add a prefetch() method to the Operator trait to warm the CPU cache." ] answer.answer = "Switch to vectorized execution (Lesson 2) — process batches of rows per next() call instead of single rows. This amortizes the dispatch overhead across hundreds of rows per call." context = """ Enum dispatch eliminates vtable indirection but doesn't address the fundamental problem: calling next() 100,000 times with one row each. Vectorized execution calls next() ~200 times with 500 rows each — the same total rows but 500x fewer function calls. The per-call overhead (virtual dispatch, function entry/exit, branch prediction) is amortized across the batch. This is why DuckDB, DataFusion, and Velox all use vectorized execution for analytical queries. Unsafe code is never the answer to algorithmic overhead. """

[[questions]] type = "MultipleChoice" prompt.prompt = """ A query plan with 5 chained operators (Scan → Filter → Filter → Project → Project) processes 100,000 rows. Under the volcano model, approximately how many function calls are made in total for next() across all operators? """ prompt.distractors = [ "100,000 — one call at the root pulls the entire chain.", "5 — one next() call per operator.", "200,000 — the Scan produces 100,000 and the first Filter doubles them." ] answer.answer = "Up to 500,000 — each of the 5 operators calls next() up to 100,000 times (once per input row). The root calls the second Project 100,000 times, which calls Filter2, which calls Filter1, which calls Scan — 5 × 100,000." context = """ In the worst case (all rows pass all filters), every operator's next() is called once per input row. The root operator calls its child 100,000 times, which calls its child 100,000 times, and so on. Total: 5 × 100,000 = 500,000 next() calls. If the filters are selective (e.g., first filter passes 10%), the upper operators are called fewer times (10,000 for everything above the first filter), reducing total calls. But the scan and first filter always process all 100,000 rows. This per-row overhead is why vectorized execution (processing batches) is critical for CPU-bound workloads. """