On MMIO, DMA, and PCIe 3.0

Ankush Jain
2 min readJul 13, 2023

The goal of this post is to reconcile these two papers and some slides from Linux Plumbers’ Conference.

https://www.usenix.org/system/files/conference/atc16/atc16_paper-kalia.pdf, and https://dl.acm.org/doi/10.1145/3230543.3230560

The ATC16 paper seems to make some confusing claims in Section 2.1 The bumps for DMA in Fig 2 of ATC16 should not be at C_rc multiples, but at MPS (PCIe Max Payload Size) multiples. It is possible that C_rc has nothing to do there.

Actually, DMA B/W varies depending on whether data transfer requests are MRd or MWr. In the MRd case, the bytes transferred will have overheads for both MRd and CplD (completions). BUT PCIe is duplex, so the bottlenecked side (CplD) will still have bumps at MPS boundaries.

MPS vs MRRS

The SC18 paper uses correct terminology for these things (as they are used by specifications etc.) but MPS vs MRRS was not entirely clear to me.

If MPS = MRRS = 128B, that means each DMA request will be handled by one completion. If MPS = 256B and MRRS = 512B, then a single read request will ask for 512B of data, which will still be served by 2 CplD responses. This probably saves one MRd_Hdr worth of traffic (24B / (24B * 2 + 512B), or ~4% in savings.

More importantly, the LPC slides suggest that these parameters are tuneable to some extent, as per hardware support.

MMIO

What about MMIO? SC18 does not discuss it but ATC16 does.

It seems that the full story there is also more complicated, but maybe in a way that’s unnecessary to understand. Write Combining is applicable to MMIO transfers (non-WC MMIO over PCIe would only be more expensive then the step function in ATC16 Fig 2.) In Intel Core architectures, WC buffers are in the cache hierarchy, and are called Line Fill Buffers. It makes sense for them to be 64B (cache line width).

(Also reading from WC memory is not allowed/super expensive, as per https://fgiesen.wordpress.com/2013/01/29/write-combining-is-not-your-friend/)

http://blog.andy.glew.ca/2011/06/write-combining.html and https://stackoverflow.com/questions/49959963/where-is-the-write-combining-buffer-located-x86

https://github.com/awslabs/aws-fpga-app-notes/blob/master/Using-PCIe-Write-Combining/README.md

If you go through Andy Glew’s blog, it seems that the relation between MMIO size and bus occupancy is not as straightforward as the step function in ATC16/Fig 2 would show. But it’s probably still directionally correct I think — also Fig 2 avoids the complication by just noting the data transferred, and making no comments on translating that to time/cycles.

Conclusions

Idk what the conclusions are — systems is hard. If simplified models of all these intricacies allow us to understand PCIe insofar as it’s relevant for I/O devices— by all means we should do that. I don’t think anything I noted above invalidates anything in those papers — just a note that behaviors are more complex, although the specifics may not be relevant for the applications/benchmarks discussed. (Also there’s a million references and I only just skimmed through them and may have gotten plenty of things wrong — caveat emptor.)

--

--