Lesson 3 — Join Algorithms

Module: Database Internals — M06: Query Processing
Position: Lesson 3 of 3
Source: Database Internals — Alex Petrov, Chapter 15; Designing Data-Intensive Applications — Martin Kleppmann, Chapter 6

Source note: This lesson was synthesized from training knowledge. Verify Petrov's join algorithm cost models and Kleppmann's distributed join discussion against the source chapters.



Context

The OOR's catalog merge problem is fundamentally a join: match TLE records from 5 sources on NORAD catalog ID, then select the best TLE for each object (most recent epoch, highest source priority). In SQL terms: SELECT * FROM source_a JOIN source_b ON a.norad_id = b.norad_id.

The choice of join algorithm determines whether this merge takes 45 seconds (nested-loop) or under 1 second (hash join). This lesson covers the three fundamental join algorithms, their cost models, and when each is optimal.


Core Concepts

Nested-Loop Join

The simplest join: for each row in the outer table, scan the entire inner table for matches.

for each row_a in source_a:          # |A| iterations
    for each row_b in source_b:      # |B| iterations per outer row
        if row_a.norad_id == row_b.norad_id:
            emit (row_a, row_b)

Cost: O(|A| × |B|) comparisons. For two 100k-row sources: 10 billion comparisons. Completely impractical for the OOR catalog merge.

When to use: Only when one side is very small (< 100 rows) or when no better algorithm is available (no index, insufficient memory for a hash table). Also useful for non-equi joins (e.g., a.epoch > b.epoch) where hash join doesn't apply.

Block nested-loop improves this by loading a block of the outer table into memory and scanning the inner table once per block. This reduces I/O from |A| × (inner scans) to |A|/B × (inner scans).

Hash Join

Build a hash table on the smaller input (the build side), then probe it with the larger input (the probe side).

Build phase: Scan the build side and insert each row into a hash table keyed by the join column.

Probe phase: Scan the probe side. For each row, hash the join column and look up matching rows in the hash table.

Build: hash_table = {}
for each row_b in source_b:
    hash_table[row_b.norad_id].append(row_b)

Probe:
for each row_a in source_a:
    for each row_b in hash_table[row_a.norad_id]:
        emit (row_a, row_b)

Cost: O(|A| + |B|) — one scan of each input. Hash table operations are O(1) amortized.

Memory: The hash table must fit in memory. Size ≈ |build_side| × (key_size + row_size + overhead). For 100k TLE records at ~100 bytes each: ~10MB. Trivially fits in memory.

When to use: Equi-joins (join on equality) where the build side fits in memory. This is the default join algorithm in most query engines for good reason — it's optimal for the vast majority of join workloads.

For the OOR: hash join merges two 100k-row sources in ~200k operations. Five sources require 4 sequential hash joins (or a multi-way hash join), all completing in under 100ms.

Sort-Merge Join

Sort both inputs on the join column, then merge them in a single pass (like the LSM merge iterator from Module 3).

Sort phase: Sort both inputs by the join key. O(|A| log |A| + |B| log |B|).

Merge phase: Advance two cursors through the sorted inputs, matching on the join key. O(|A| + |B|).

Sort source_a by norad_id
Sort source_b by norad_id

cursor_a = 0, cursor_b = 0
while cursor_a < |A| and cursor_b < |B|:
    if a[cursor_a].norad_id == b[cursor_b].norad_id:
        emit (a[cursor_a], b[cursor_b])
        advance both cursors (handling duplicates)
    elif a[cursor_a].norad_id < b[cursor_b].norad_id:
        cursor_a += 1
    else:
        cursor_b += 1

Cost: O(|A| log |A| + |B| log |B|) for the sort phases, O(|A| + |B|) for the merge. Dominated by the sort.

When to use: When inputs are already sorted (e.g., from an index scan or a preceding sort operator), the sort phase is free and the total cost is O(|A| + |B|) — optimal. Also useful when the join result must be sorted (the output is already in join-key order). Handles non-memory-fitting inputs gracefully via external sort.

For the OOR: if TLE sources are pre-sorted by NORAD ID (which they often are, since NORAD IDs are sequential), sort-merge join is optimal — the sort phase costs nothing, and the merge is a single linear pass.

Cost Comparison

AlgorithmTimeMemoryPre-sorted Input
Nested-loopO(A × B)O(1)No benefit
Hash joinO(A + B)O(min(A,B))No benefit
Sort-mergeO(A log A + B log B)O(A + B) for sortO(A + B) if pre-sorted

Multi-Way Join for 5 Sources

The OOR catalog merge joins 5 sources. Strategies:

Sequential pairwise: Join source 1 with 2, then result with 3, then with 4, then with 5. Four hash joins. Total cost: O(5 × N) where N is the source size. Simple and effective.

Multi-way sort-merge: Sort all 5 sources by NORAD ID, then merge all 5 simultaneously using a priority queue (exactly the merge iterator from Module 3). One pass through all data. Optimal if sources are pre-sorted.

For the OOR, the multi-way sort-merge is the better choice: TLE sources arrive pre-sorted by NORAD ID, and the merge iterator is already implemented.


Code Examples

Hash Join Implementation

use std::collections::HashMap;

/// Hash join: match TLE records from two sources on NORAD ID.
fn hash_join(
    build_side: &[TleRow],   // Smaller source
    probe_side: &[TleRow],   // Larger source
) -> Vec<(TleRow, TleRow)> {
    // Build phase: index the build side by NORAD ID
    let mut hash_table: HashMap<u32, Vec<&TleRow>> = HashMap::new();
    for row in build_side {
        hash_table.entry(row.norad_id).or_default().push(row);
    }

    // Probe phase: look up each probe-side row in the hash table
    let mut results = Vec::new();
    for probe_row in probe_side {
        if let Some(matches) = hash_table.get(&probe_row.norad_id) {
            for &build_row in matches {
                results.push((build_row.clone(), probe_row.clone()));
            }
        }
    }
    results
}

Sort-Merge Join for Pre-Sorted Sources

/// Sort-merge join on pre-sorted inputs. Both inputs must be sorted by norad_id.
fn sort_merge_join(
    left: &[TleRow],
    right: &[TleRow],
) -> Vec<(TleRow, TleRow)> {
    let mut results = Vec::new();
    let mut li = 0;
    let mut ri = 0;

    while li < left.len() && ri < right.len() {
        match left[li].norad_id.cmp(&right[ri].norad_id) {
            std::cmp::Ordering::Equal => {
                // Collect all rows with this key from both sides
                let key = left[li].norad_id;
                let l_start = li;
                while li < left.len() && left[li].norad_id == key { li += 1; }
                let r_start = ri;
                while ri < right.len() && right[ri].norad_id == key { ri += 1; }

                // Cross product of matching rows (for equi-join)
                for l in &left[l_start..li] {
                    for r in &right[r_start..ri] {
                        results.push((l.clone(), r.clone()));
                    }
                }
            }
            std::cmp::Ordering::Less => li += 1,
            std::cmp::Ordering::Greater => ri += 1,
        }
    }
    results
}

The sort-merge join's merge phase is identical to the LSM merge iterator logic. For the OOR's unique NORAD IDs (no duplicates within a source), the cross-product in the equal case always produces exactly one match — the merge is linear.


Key Takeaways

  • Nested-loop join is O(A × B) — only viable for very small inputs. Hash join is O(A + B) with O(min(A,B)) memory. Sort-merge join is O(A + B) if inputs are pre-sorted.
  • Hash join is the default for equi-joins in most query engines. It requires the build side to fit in memory, which is almost always true for the OOR's workload sizes.
  • Sort-merge join is optimal when inputs are pre-sorted (the sort phase is free). The LSM merge iterator from Module 3 is already a sort-merge join — the same algorithm applies here.
  • The OOR catalog merge (5 pre-sorted sources × 100k objects) is best served by a multi-way sort-merge: one linear pass through all sources using a merge iterator with a priority queue.
  • Join algorithm selection is a query optimization decision. The execution engine should support all three algorithms and choose based on input sizes, sort order, and available memory.