On B+ Trees vs LSM Trees

B Tree: tree with a fanout of k (lower depth -> lower length of chained lookup pattern), also self-balancing. Insert/Search/Delete all O(log N). Extension of binary tree, suitable for storage devices, because storage latency sucks and you want smaller lookup chains. Variants used in NTFS/APFS/btrfs/extr/Reiser4.

B+ Tree: popularly used in RDBMS. Only keys stored in intermediate nodes. All values stored with keys at the root level. Good for scans and range queries.

LSM Tree: used in KV stores/NoSQL stores. Does not maintain a full order (levels internally sorted, but overlap with each other). Batches/delays inserts (memtable to sstable). Order of magnitude better for write/update performance. Not great for scans or range queries.

  • Are LSM trees objectively better than B+ trees?
  • Candidate argument 1: more batched write pattern, lower WAF.
  • Candidate argument 2: better caching behavior. Since LSMT blocks are infrequently updated, they can be augmented with filters (bloom/cuckoo). More below.

Caching Behavior — LSM vs B+T

Assertion: a B+ Tree can only cache m ≤ k levels in memory, and needs to do k-m I/Os for a disk lookup.

An LSM Tree, because of its relative persistence of SSTables, is more amenable to bloom filters. Not sure what SOTA for LSMT index caching is.

SILT lookups require 1.01 I/Os on an average — SOTA for DRAM + SSD KV stores. SILT is not LSM Tree-based, though.

MICA: in-memory multicore KV store. Each core owns a partition of the data, NIC routes shard requests to appropriate core. (NIC routing is only useful if req latency is to be super low, hence this is predominantly for in-memory systems).

Residual questions: what’s the state of the art for LSMT caching?

Also read: “Towards Accurate and Fast Evaluation of Multi-Stage Log-structured Designs”

Guards etc — make LSM trees closer to Be-Trees (PebbleDB, Salesforce?)