Modern storage stacks — 2–7 GB/s, with 4–5 us latencies
Half the latency comes from software stack — bad. Papers criticize SPDK/kernel bypass as having a bunch of problems (polling, wasted CPU etc). Instead use eBPF to inject functions into kernelspace.
Breakdown of 6.27 us read() syscall, 512B, Intel Optane:
- kernel crossing: 351 ns
- read syscall: 199 ns
- ext4: 2006 ns
- bio: 379 ns
- NVMe driver: 113 ns
- storage device: 3224 ns
(filesystem/block IO/NVMe use submission and completion queues it seems)
SR-IOV — NVMe standard capability to partition a physical device into multiple namespaces. Claimed problem with userspace/kernel bypass — sharing files or capacity between distrusting processes (SR-IOV provides limited number of namespaces).
io_uring — batched I/O submission to amortize the cost of kernel boundary crossing. but all layers (ext4/bio/nvme) are still necessary.
BPF for storage — similar arguments to RPC vs RDMA (RPC can do multiple lookups before returning to client).
— — — — — — — — — — — -
Core part of their design is bypassing the filesystem logic. They say that most data structures (B+ Trees and LSM Trees) do not change SSTable extents — they are immutable, and this relative stability can be exploited to bypass the filesystem.
— — — — — — — — — — —
questions to address:
- why/how is BPF faster than SPDK?
- SPDK should offer the best perf at the cost of polling etc.
- BPF should have some doorbell penalty on completion
BPF for Storage, An Exokernel-Inspired Approach, Zhong et al, HotOS ‘21
XRP: In-Kernel Storage Functions with eBPF, Zhong et al, OSDI ‘22