notes on ebpf/kernel bypass/storage latency

Ankush Jain
2 min readAug 24, 2022

--

Modern storage stacks — 2–7 GB/s, with 4–5 us latencies

Half the latency comes from software stack — bad. Papers criticize SPDK/kernel bypass as having a bunch of problems (polling, wasted CPU etc). Instead use eBPF to inject functions into kernelspace.

Breakdown of 6.27 us read() syscall, 512B, Intel Optane:

  • kernel crossing: 351 ns
  • read syscall: 199 ns
  • ext4: 2006 ns
  • bio: 379 ns
  • NVMe driver: 113 ns
  • storage device: 3224 ns

(filesystem/block IO/NVMe use submission and completion queues it seems)

SR-IOV — NVMe standard capability to partition a physical device into multiple namespaces. Claimed problem with userspace/kernel bypass — sharing files or capacity between distrusting processes (SR-IOV provides limited number of namespaces).

io_uring — batched I/O submission to amortize the cost of kernel boundary crossing. but all layers (ext4/bio/nvme) are still necessary.

BPF for storage — similar arguments to RPC vs RDMA (RPC can do multiple lookups before returning to client).

— — — — — — — — — — — -

Core part of their design is bypassing the filesystem logic. They say that most data structures (B+ Trees and LSM Trees) do not change SSTable extents — they are immutable, and this relative stability can be exploited to bypass the filesystem.

— — — — — — — — — — —

questions to address:

  • why/how is BPF faster than SPDK?
  • SPDK should offer the best perf at the cost of polling etc.
  • BPF should have some doorbell penalty on completion

References:

BPF for Storage, An Exokernel-Inspired Approach, Zhong et al, HotOS ‘21

XRP: In-Kernel Storage Functions with eBPF, Zhong et al, OSDI ‘22

--

--