MPI Collectives: Notes, benchmarks etc

Ankush Jain
4 min readJul 11, 2022

--

  1. MPI_Gather: all-to-one collection of data at Rank 0
  2. MPI_Gatherv: like MPI_Gather, but amount of data sent by each rank can be variable (root supplies a displs array to store variable-sized data elements)
  3. MPI_AllGather: like MPI_Gather + broadcasting collected data
  4. MPI_IAllGather: like MPI_AllGather, but asynchronous. A request object needs to be MPI_Wait’ed upon before the collective is considered complete
  5. MPI_IAllGatherv: asynchronous, variable length

OSU benchmarks

Columns for MPI_IAllGatherv: Overall, Compute, Init, Test, Wait, Pure Comm, Min Comm, Max Comm, Overlap

(Test_Time = 0 here)

Overall = Init_Time + Compute_Time + Wait_Time

Compute/Init/Test/Wait = What they say

Pure/Min/Max Comm = Invocations of the collective without a compute kernel in between

Overlap = a bit weird, shows 0%

X = overall_time-cpu_time / avg_comm_time * 100

overlap = MAX(0, 100 — X)

If overall_time >>> cpu_time, X = 100. overlap = 0

If overall_time ~= cpu_time, X = 0. overlap = 100.

Overlapping transfer and computation

mpirun -f hosts.txt -n 512 /users/ankushj/repos/intel-mpi-benchmarks/osump-mpich-default-ubuntu2004/build-mpich-default-ubuntu-2004/mpi/collective/osu_iallgather -m 2:256 -i 1000 -x 200 -f

This command will result in a 0% overlap across the communication. Despite the compute kernel giving plenty of cover for the communication to be processed in the background, the communication takes the same time with the compute as without.

But when -t [1–100] is added, the compute kernel is interspersed with MPI_Test calls. The way most MPI implementations work, a tiny bit of progress is embedded into each MPI_Test call (as these are supposed to be non-blocking). So the more MPI_Test calls that are invoked before an MPI_Wait, the more overlap is achievable (with diminishing returns).

MPI_Wait is blocking, so it will do all residual progress.

(MPI_Test calls also have a 1us/call fixed overhead, so the appropriate number of calls is a function of the amount of data being transferred.)

(Also, with PSM, setting IPATH_NO_CPUAFFINITY seems to make no difference to these numbers)

Typical Latencies

(base) ankushj@h0 ~/r/a/scripts ❯❯❯ run512 $BENCH/osu_iallgather -m 1:512 -i 100 -x 20 -f -t 25

# OSU MPI Non-blocking Allgather Latency Test v5.9
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size Overall(us) Compute(us) Coll. Init(us) MPI_Test(us) MPI_Wait(us) Pure Comm.(us) Min Comm.(us) Max Comm.(us) Overlap(%)
1 64.55 50.97 1.58 9.17 2.67 33.67 31.65 33.89 59.67
2 64.81 50.38 1.55 9.25 3.46 28.52 25.46 30.37 49.41
4 67.71 50.93 1.64 11.01 3.98 37.08 31.95 38.47 54.75
8 96.58 74.25 1.68 15.18 5.32 52.07 47.72 54.50 57.11
16 144.70 103.28 1.81 22.04 17.42 92.07 85.80 93.57 55.01
32 231.64 168.59 2.32 28.53 32.04 148.90 135.29 153.00 57.65
64 384.24 278.15 2.65 33.08 70.19 255.50 223.17 266.76 58.48
128 690.38 494.93 2.96 40.90 151.42 465.71 401.01 498.71 58.03
256 1290.56 880.76 3.32 54.11 352.20 839.51 763.60 880.80 51.18
512 2679.94 1694.08 3.54 133.39 848.75 1627.27 1596.83 1693.82 39.42

(base) ankushj@h0 ~/r/a/scripts ❯❯❯ run512 $BENCH/osu_gather -m 1:512 -i 1000 -x 200 -f

# OSU MPI Gather Latency Test v5.9
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 0.97 0.32 17.95 1000
2 1.08 0.42 18.93 1000
4 1.09 0.42 19.49 1000
8 1.15 0.42 21.60 1000
16 1.25 0.45 24.03 1000
32 1.34 0.46 29.23 1000
64 1.57 0.52 39.17 1000
128 1.78 0.51 57.18 1000
256 2.25 0.52 100.85 1000
512 3.24 0.54 175.81 1000

(base) ankushj@h0 ~/r/a/scripts ❯❯❯ run512 $BENCH/osu_igather -m 1:512 -i 1000 -x 200 -f

# OSU MPI Non-blocking Gather Latency Test v5.9
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size Overall(us) Compute(us) Coll. Init(us) MPI_Test(us) MPI_Wait(us) Pure Comm.(us) Min Comm.(us) Max Comm.(us) Overlap(%)
1 3.65 2.05 0.55 0.00 0.90 1.59 0.85 19.79 0.00
2 3.68 2.06 0.55 0.00 0.91 1.59 0.87 20.14 0.00
4 3.70 2.07 0.56 0.00 0.91 1.62 0.90 21.40 0.00
8 3.84 2.16 0.57 0.00 0.96 1.69 0.84 23.03 0.73
16 4.12 2.36 0.59 0.00 1.02 1.79 0.94 23.61 1.25
32 4.35 2.49 0.60 0.00 1.11 1.93 0.98 29.55 3.50
64 5.06 2.94 0.65 0.00 1.32 2.16 1.05 38.81 1.70
128 5.92 3.22 0.67 0.00 1.88 2.48 1.03 54.26 0.00
256 9.72 4.57 0.74 0.00 4.25 3.76 1.25 138.81 0.00
512 12.18 5.02 0.81 0.00 6.18 4.26 1.11 179.72 0.00

(base) ankushj@h0 ~/r/a/scripts ❯❯❯ run512 $BENCH/osu_bcast -m 1:512 -i 1000 -x 200 -f (base) main

# OSU MPI Broadcast Latency Test v5.9
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 8.14 3.40 13.52 1000
2 8.06 3.34 13.45 1000
4 7.90 3.34 13.28 1000
8 8.14 3.37 13.33 1000
16 9.18 3.59 15.36 1000
32 8.96 3.51 15.38 1000
64 9.09 3.53 15.14 1000
128 9.23 3.61 15.54 1000
256 9.76 3.80 16.62 1000
512 10.72 3.92 17.93 1000

--

--