On doorbells/NVMe etc.

Ankush Jain
2 min readAug 24, 2022


  • libfabric queue pairs and ops on those are thread safe. one queue pair can be shared between threads (although there might be a performance cost).
  • NVMe queue pairs are to be assigned to threads — not thread safe (and user or kernel-level locks would be too slow). Wonder why.
  • NVMe work posting also dictates a poll vs interrupt decision. Then why is BPF more performant than SPDK.

SAS and SATA had a single queue (256 and 32 commands respectively). This was okay because spinning disks had no inherent parallelism.

Things have changed with flash. SSD controllers are multicore, and various levels of parallelism are available at the die/plane/etc levels. Millions of IOPS possible.

NVMe standard does 64K queues and 64K commands per queue. Currently available devices and controllers usually only support a small fraction of that number. More outstanding commands also allow OoO machinery — commands can be merged or reordered.

NVMe perf [1] also recommends doorbell batching (+ something else, Trick 1) to minimize MMIOs. Naive: 1 to post SQ, 1 on CQ. Better: doorbell batching (write doorbell every 512 commands, devices usually have 1024 slots). For more IOPS, do doorbell batching on submission side as well.

Doorbell batching on submission side: copy command to SQ, but don’t ring the SQ doorbell. Ring it when the user polls (this is like how MPI does network progress only when you call MPI_Probe or MPI_Test, methinks). Massive IOPS impact (4X-ish).

[1] is interesting. Eliminating data dependent loads etc — standard userspace IOPS tricks. 5% improvement. Basically cleverly prefetching relevant structs in the polling loop.

Residual questions:

Doorbell batching (both submission and completion) is super important. Can BPF-based I/O batch this? How can it be faster without doorbell batching?


[1] https://www.snia.org/sites/default/files/SDC/2019/presentations/NVMe/Walker_Benjamin_10%20Million_IOps_From_a_Single_Thread.pdf