Every number, wins and losses.
cargo run -p ffs-evals --release runs every scenario
every time and prints both faster and slower paths. Numbers below are
from a single run on Apple M-series silicon; machine spec,
methodology, and the rationale for each comparator live in
ffs-evals/src/main.rs. Losses are published
in the same tables as the wins.
Headline results
| Comparator | Shape | Measured path | Result |
|---|---|---|---|
| sled | embedded KV | insert / lookup / 100-key range scan | ffs 2.89x / 1.55x / 3.94x |
| instant-distance | embedded HNSW | build / query, dim 128, k=10 | ffs 1.13x / 2.64x |
| petgraph | in-memory graph | 1-hop / 2-hop BFS / edge append | ffs 9.28x / 1.43x · append 9.64x slower |
| Postgres + pgvector | server vector DB | HNSW query, same algorithm and parameters | ffs 3.01x |
| LanceDB | embedded vector DB | query vs auto IVF-PQ index / vs brute force | ffs 3.03x / 4.78x |
| Kùzu | embedded graph DB | bulk load / 1-hop — native API vs Cypher, stack cost* | ffs 18.5x / ~29,000x* |
| Neo4j | server graph DB | bulk load / 1-hop — native API vs Bolt, stack cost* | ffs ~1,200x / ~195,000x* |
* The starred ratios measure architectural stack cost rather than engine parity. "The loud numbers" below explains exactly what they do and do not show.
vs sled
| Op | ffs::BTree | sled | Result |
|---|---|---|---|
| insert (random u64) | 429 ns/op | 1241 ns/op | ffs 2.89x |
| lookup (random u64) | 182 ns/op | 282 ns/op | ffs 1.55x |
| range scan (100-key) | 796 ns/op | 3134 ns/op | ffs 3.94x |
100,000 random inserts and lookups, 1,000 range scans, fresh DB per run. The FFS tree is backed by a 4096-page buffer pool; reads and writes operate directly on cached page bytes with no node materialization on the hot paths. The engine page walks through the four changes that took inserts from 4676 ns to 429 ns.
vs instant-distance
| Op | ffs::Hnsw | instant-distance | Result |
|---|---|---|---|
| build (per insert) | 202 µs/op | 228 µs/op | ffs 1.13x |
| query (per search) | 113 µs/op | 298 µs/op | ffs 2.64x |
| recall@10 | 0.966 | 0.969 | comparable |
10,000 random vectors at dim 128, M=16, k=10, recall measured against an exact brute-force oracle over 1,000 queries. The speedup is the epoch-stamped visited-scratch trick; recall stays even.
vs petgraph
| Op | ffs::Csr | petgraph | Result |
|---|---|---|---|
| append edge | 30 ns/op | 3 ns/op | petgraph 9.64x |
| 1-hop out-neighbours | 11 ns/op | 102 ns/op | ffs 9.28x |
| 2-hop BFS reachable set | 1137 ns/op | 1630 ns/op | ffs 1.43x |
50,000 vertices, 500,000 random directed edges, 1,000 traversal sources. The intended tradeoff is visible in the first row: FFS pays at append time for gap-preserving rel-list bookkeeping and collects on every read, because neighbors are a dense slice with cache locality an edge-list cannot match.
vs Postgres + pgvector
| Approach | Build path | Query latency |
|---|---|---|
| ffs::Hnsw | per-insert HNSW graph build | 126 µs/query |
| pgvector, HNSW index | 585 µs/insert + 5.75 s index build | 380 µs/query |
| pgvector, no index | 9.5 µs/insert | 1877 µs/query |
Postgres 17 + pgvector over the default Unix socket, same HNSW algorithm, same M=16 and ef_construction=200 on both sides. This is the cleanest apples-to-apples vector number in the matrix: the 3.01x gap is the SQL parse, libpq round-trip, and row deserialization between the application and the index — same algorithm, different stack shape.
vs LanceDB
| Approach | Build path | Query latency |
|---|---|---|
| ffs::Hnsw | per-insert HNSW graph build | 115 µs/query |
| LanceDB, auto IVF-PQ index | 17.6 µs/insert avg + 0.17 s index | 350 µs/query |
| LanceDB, brute scan | 523 ns/insert, no index | 550 µs/query |
10,000 vectors at dim 128, 200 queries at k=10. LanceDB's bulk load is much cheaper because it builds no index inline; FFS pays at insert time and queries are quick from the first insert. Which side of that tradeoff matters depends on the workload.
The loud numbers
| Op | ffs::Csr | Kùzu | Neo4j (Bolt) |
|---|---|---|---|
| bulk load (per edge) | 21–22 ns/op | 381 ns/op | 26,359 ns/op |
| 1-hop out-neighbours | 10 ns/op | 289,462 ns/op | 1,877,869 ns/op |
5,000 vertices, 25,000 edges, 200 traversal sources. These ratios are
loud and the README names what they actually measure. FFS here is a
native Rust call — csr.neighbors(src) is a slice
iterator. Kùzu runs each query through ANTLR parse, binder,
planner, executor, and FFI marshalling even with a prepared
statement; Neo4j adds a Bolt server round-trip on top. That stack is
the cost real applications pay on every query, and FFS's
collapsed-into-one-process design avoids it by construction. The
apples-to-apples follow-up — FFS Cypher over stored graph data vs
Kùzu Cypher — is named in the README as the next comparison to
add. Until then, read these rows as stack-overhead evidence and
remaining headroom, never as a sibling-database score.
Predicate-aware retrieval
| Selectivity | Strategy | Latency | Recall@10 |
|---|---|---|---|
| 10% | post-filter, 10x oversample | 125 µs/query | 85% |
| 10% | predicate-aware ANN | 618 µs/query | 100% |
| 1% | post-filter, 10x oversample | 126 µs/query | 11% |
| 1% | predicate-aware ANN | 2535 µs/query | 100% |
20,000 nodes at dim 128, graph predicate dept_id == 0,
top-10 nearest neighbours among matching nodes. The honest read: at
low selectivity, post-filtering isn't faster — it's wrong. 11% recall
means the user gets one of the ten results they asked for. Matching
predicate-aware recall would need ~100x oversampling, which closes
the latency gap anyway. The 2.5 ms query is the cost of correctness,
paid once, inside one engine — a three-system stack pays
serialization, network, and join-on-id on every call before it even
gets the answer wrong.
Coverage
| Comparator | Shape | Status |
|---|---|---|
| sled | embedded KV | measured — faster on all three ops |
| instant-distance | embedded HNSW | measured — faster on build and query |
| petgraph | in-memory graph | measured — faster on reads, slower on appends |
| Kùzu | embedded graph DB | measured — native API vs Cypher, stack cost |
| LanceDB | embedded vector DB | measured — query 3x faster than indexed path |
| Neo4j | server graph DB | measured — native API vs Bolt, stack cost |
| Postgres + pgvector | server vector DB | measured — 3.01x, same algorithm and parameters |
| Memgraph / FalkorDB | server graph | deferred — needs Docker |
| Postgres + AGE | server graph extension | deferred — needs a source build |
Where FFS is slower, the harness measures it. Where the comparator cannot run yet, the table says so. The aim is a comparator matrix for graph, vector, and typed storage paths rather than one hand-picked headline number.