Inside the engine.
Typed records, relationship tables, vector indexes, provenance, MVCC, WAL/qlog recovery, and daemon commands on one storage layer behind one transaction boundary. Every layer below maps to an architecture decision record in the repo.
Schematic
┌──────────────────────────────────────────────────────────────────┐
│ query surface typed Cypher subset (parser + planner) │
├──────────────────────────────────────────────────────────────────┤
│ execution factorized vector groups · morsels │
│ NodeScan · Filter · Project · Limit │
│ ExtendOut · ExtendIn · Count │
├──────────────────────────────────────────────────────────────────┤
│ index B+-tree (primary, secondary) │
│ HNSW vector index │
├──────────────────────────────────────────────────────────────────┤
│ storage node groups · column chunks · CSR rel │
│ MVCC · WAL · pager · buffer pool │
└──────────────────────────────────────────────────────────────────┘
Each layer maps to one or more architecture decision records in
DECISIONS/, numbered in dependency order; ADR 0001 is
the node-group storage layout and the place to start reading.
Layer ledger
| Layer | In tree today | State |
|---|---|---|
| Pager + buffer pool | Single file, 8 KiB CRC-backed pages, full-file integrity scan, deterministic eviction with hit/miss/write-back counters. | shipped |
| Node groups + column chunks | 131K tuples per group; null, int, float, dictionary, dense-F32 embedding, and var-bytes chunk codecs with nullable variants. | shipped |
| Relationship tables | A+ double CSR, forward and backward adjacency persisted, gap-preserving rel-lists with 1.1x growth. | shipped |
| MVCC + WAL | Snapshot isolation, version chains, WAL encode/decode/replay, recovery tested under torn and corrupt tails. | shipped |
| Schema catalog | Typed catalog persisted through pager chunks; secondary-index roots and HNSW handles survive close/reopen via the superblock. | shipped |
| Schema migrations | ALTER ... ADD/RENAME/DROP PROPERTY with physical backfill chunks for scalar, embedding, Array, and Struct defaults; complex shapes remain. | partial |
| B+-tree indexes | Primary fixed-key and variable-key trees persisted; descent and in-place leaf insert run directly on cached page bytes. | shipped |
| Secondary indexes | On-disk trees with catalog manifests, cost-aware candidate selection across equality/range/string/null predicates; full composite plans pending. | partial |
| HNSW vector index | Persistent (Navix pattern): vector chunks, tombstone/upsert maintenance, compaction, delayed reclaimed-page reuse. | shipped |
| Execution operators | NodeScan, Filter, Project, Limit, Count, ExtendOut/In, IMJ, HashJoin, OrderBy, Distinct, Union; broader morsel threading remains. | partial |
| Planner | Filter-pushdown rewrites, label-cardinality-aware index costs, DP join ordering with observed-cardinality hooks; broader WCO/adaptive work pending. | partial |
| Cypher runtime | Parser, evaluator, executor: OPTIONAL MATCH, WITH, UNWIND, parameters, aggregates, real MERGE, DETACH DELETE, bounded variable-length paths. | shipped |
| Query-log durability | Every mutation appends a CRC32-framed qlog record; replay reproduces the durable prefix and tolerates torn tails; compaction and export/import. | shipped |
| Multi-source BFS | Then 2014 MS-BFS with u64 seed bitmaps, scale-tested to 1M nodes / 1.55M edges. | shipped |
ffsd daemon | Single-owner lock, startup integrity scan, 40+ line-oriented commands across query, ingest, validation, snapshots, and metrics; richer clients ahead. | partial |
| Python binding | pyo3 abi3 wheels for Linux/macOS, async driver, temporal and property-bag marshalling tested through the binding core. | shipped |
| Simulation | 11 deterministic crash/recovery scenarios: WAL torn tails, power-loss prefixes, randomized mutation schedules with minimized failing prefixes. | shipped |
| v1 spine (flag-gated) | On-disk read path: node, CSR, and edge shadows persisted to sibling pagers, snapshot visibility from disk, dirty-tail overlay, read-through page cache taking point reads from ~448 to ~1.6M ops/sec. Default-off; the write side still dual-writes. | partial |
How the B+-tree got fast
The insert path started as a sorted-Vec placeholder losing to sled by 8.8x on lookups. Four changes, each measured before the next:
Real pages. Replaced the in-memory sorted Vec with 8 KiB on-disk pages, splits propagating up to a fresh root, leaf sibling pointers (ADR 0008). Baseline after: 4676 ns insert, 3417 ns lookup.
Buffer pool. Every descent reads a memory-resident page slot instead of issuing seek-plus-read per touch (ADR 0002). Lookups dropped to 999 ns.
Decode-less descent. Binary-search directly on the cached page bytes without materializing a key vector. Lookups dropped to 136 ns, ahead of sled by 2x.
In-place leaf insert. memmove directly
on the cached page; decode-and-split only when the leaf actually
overflows, under 1% of inserts on this workload. Inserts dropped from
1490 ns to 429 ns. A non-overflowing insert touches exactly one leaf
page and makes zero heap allocations.
Speed mechanics
HNSW scratch. Vector search uses epoch-stamped
thread-local visited scratch instead of allocating a
HashSet per search layer: one allocation up front, O(1)
check-and-set per node visit, no hashing. The same change speeds up
build, since insert calls search_layer during
construction.
CSR traversal. Relationship tables keep gap-preserving CSR rel-lists with 1.1x growth. Appends pay for the bookkeeping (30 ns vs petgraph's 3 ns) and 1-hop reads collect it back: 11 ns vs 102 ns, because neighbors are a dense slice.
Multi-source BFS. cypher::query::bfs
implements Then 2014 MS-BFS with u64 seed bitmaps. 15 seeds at depth
4 on a 35K-node graph: 14 ms, against 68.8 s for the per-seed/per-hop
driver loop. 129 ms on a 1M-node / 1.55M-edge scale run.
Planner cost. The planner keeps materialized B+-tree
and HNSW probe work visible through LIMIT, charges
label-cardinality-aware descent, narrows projections before adjacency
expansion, and carries observed-cardinality hooks for adaptive join
ordering.
Morsel parallelism. Filter selection, Limit truncation, Count reduction, HashJoin build/probe/output, Distinct and OrderBy key extraction, and IMJ row probing can all split large chunks across morsels; broader threading through the executor is in flight.
Write-back durability. Native mutation batches, Cypher writes, vector tombstones/upserts, secondary-index updates, feed checkpoints, and provenance write-back all ride the same qlog/WAL machinery, so durability is one discipline instead of per-store reconciliation.
Operator surface
| Command family | What exists in tree |
|---|---|
QUERY / QUERY_PARAMS | Line-oriented Cypher execution, parameter maps, EXPLAIN/PROFILE paths. |
APPLY_MUTATIONS / LOAD_MUTATIONS | Native durable batch ingest with syntax-and-semantics preflight and qlog receipts. |
LOAD_NODES / LOAD_EDGES | Whole-file-preflighted CSV loads, including append-only and checkpointed-resume variants. |
LOAD_MANIFEST | Ordered native load plans: CSV batches, DDL, assertions, verify gates, metrics, snapshots, export/import, compaction steps. |
VERIFY / METRICS / QLOG_STATUS | Storage and qlog integrity scans, latency buckets with exact rolling percentiles, error counters, replay posture. |
SNAPSHOT / EXPORT / IMPORT | Pager+qlog snapshots and logical dumps, with live-path and collision preflight before anything is copied or truncated. |
Why Rust
No GC pause budget. A storage engine touches pages, indexes, buffers, WAL records, and vector graph structures on hot paths. The goals doc names memory safety without GC pauses as a reason for the Rust core.
Page-byte control. The B+-tree speedup above came entirely from operating on cached page bytes: buffer-pool integration, decode-less descent, in-place leaf writes.
Scratch discipline. The HNSW query speedup came from
replacing per-call visited HashSet tracking with an
epoch-stamped scratch vector that lives across calls.
Unsafe policy. The workspace denies unsafe operations in unsafe functions, warns on undocumented unsafe blocks, and the goals doc requires a safety proof for every unsafe block.
Rust example
use ffs::cypher::{compile_query, parse};
use ffs::exec::collect_column;
use ffs::planner::{compile, CompileCtx};
use ffs::storage::RelTable;
// A tiny social graph: 0 → 1, 0 → 2, 1 → 2, 2 → 3.
let mut knows = RelTable::new(0, "KNOWS", 4, 4);
knows.add_edge(0, 0, 1, 1);
knows.add_edge(0, 0, 2, 2);
knows.add_edge(1, 1, 2, 2);
knows.add_edge(2, 2, 3, 3);
// One-hop traversal, end-to-end through the engine.
let q = parse("MATCH (a:Person)-[:KNOWS]->(b:Person) RETURN b").unwrap();
let plan = compile_query(&q, |_| vec![0, 1, 2, 3]);
let ctx = CompileCtx::empty().with_rel_table("KNOWS", &knows);
let mut op = compile(plan, &ctx).unwrap();
let mut neighbours = collect_column(&mut *op, 0);
neighbours.sort();
assert_eq!(neighbours, vec![1, 2, 2, 3]);
The bigger demo — graph, vector, and provenance in one transaction
with a mocked LLM — lives at
ffs-demo/examples/write_back_loop.rs in the repo.