Bert Vandenbroucke · b33a78ce
--- a/HLRS-HAWK-(AMD).md
+++ b/HLRS-HAWK-(AMD).md
@@ -105,7 +105,116 @@ export OMP_NUM_THREADS=64
 ./swift --cosmology --eagle --threads=64 --pin  eagle_12.yml -v 1
 ```

-# Pinning and MPI
+# MPI
+
+At the time of writing (18/01/2022) there is no stable way to run SWIFT over MPI on multiple nodes with modules available on HAWK. The default MPT library on the system suffers from occasional deadlocks even when run on a single node. OpenMPI does better, but still has issues when run over the interconnect. These issues can be minimised by running with 1 rank per NUMA domain (8 ranks per node). The most stable configuration at the time of writing is to use a custom version of OpenMPI 4.1.2.
+
+## OpenMPI
+
+When using OpenMPI, you need an additional arguments to `mpirun` to ensure proper matching of ranks to processors: `--map-by`. Recommended values are:
+
+ - for 1 rank/node: `--map-by ppr:1:node:pe=128`: assigns one rank per node and assigns all 128 cores as threads to that rank
+ - for 2 ranks/node: `--map-by ppr:1:socket:pe=64`: assigns a rank to each socket (physical CPU) and pins the rank's 64 threads to the 64 cores of that CPU
+ - for 4 ranks/node: `--map-by ppr:2:socket:pe=32`
+ - for 8 ranks/node: `--map-by ppr:4:socket:pe=16`. The same can be achieved using `--map-by ppr:1:numa:pe=16`.
+
+There is a significant speed gain when using 4 or 8 ranks/node compared to 1 or 2 ranks/node.
+
+### Single node
+
+The following job script will work for a single node MPI job, using OpenMPI 4.0.4:
+
+```
+#!/bin/bash
+#PBS -N SWIFT-EAGLE-12
+#PBS -l select=1:node_type=rome:mpiprocs=8:ompthreads=16
+#PBS -l walltime=24:00:00
+
+# Change to the directory that the job was submitted from
+cd $PBS_O_WORKDIR
+
+# Unload the defaults
+module purge
+
+# Load the modules we need
+module load intel/19.1.0
+module load openmpi/4.0.4
+module load parmetis/4.0.3-int64
+module load fftw/3.3.8
+module load tbb/19.1.0
+module load gsl
+
+# use 8 ranks/node
+nrank=8
+mapby="ppr:4:socket:pe=16"
+
+# Run SWIFT
+mpirun -np $nrank --map-by $mapby \
+  ./swift_mpi --cosmology --eagle --threads=16 --pin  eagle_12.yml -v 1
+```
+
+### Multinode
+
+On multiple nodes, SWIFT occasionally deadlocks because of a known thread-safety issue in OpenMPI 4.0.4. This can be avoided by not using some of the transports in the underlying UCX communication layer. This is illustrated in the following script that runs on 8 nodes:
+
+```
+#!/bin/bash
+#PBS -N SWIFT-EAGLE-12
+#PBS -l select=8:node_type=rome:mpiprocs=8:ompthreads=128
+#PBS -l walltime=24:00:00
+
+# Change to the directory that the job was submitted from
+cd $PBS_O_WORKDIR
+
+# Unload the defaults
+module purge
+
+# Load the modules we need
+module load intel/19.1.0
+module load openmpi/4.0.4
+module load parmetis/4.0.3-int64
+module load fftw/3.3.8
+module load hdf5/1.10.5
+module load tbb/19.1.0
+module load gsl
+
+# use 8 ranks/node
+nrank=128
+mapby="ppr:4:socket:pe=16"
+
+# disable problematic inter-node transports
+export UCX_TLS=self,sm,ud
+
+# Run SWIFT
+mpirun -np $nrank --map-by $mapby ./swift_mpi --cosmology --eagle --threads=16 --pin eagle_12.yml -v 1
+```
+
+By limiting the number of transports, the code can be up to a factor 4 slower than it should be. When using relatively little ranks, additionally setting the following environment variables counteracts this efficiency loss:
+
+```
+export UCX_MM_RX_MAX_BUFS=65536
+export UCX_IB_RX_MAX_BUFS=65536
+export UCX_IB_TX_MAX_BUFS=65536
+export UCX_UD_MLX5_RX_QUEUE_LEN=16384
+```
+
+However, using the same environment variables for larger runs causes errors.
+
+## OpenMPI 4.1.2
+
+The bug that prevents optimal multinode performance in the default OpenMPI 4.0.4 was fixed in OpenMPI 4.1.1. Using OpenMPI 4.1.2 (the latest stable release at the time of writing - 18/01/2022) restores full multinode performance without the need to add additional environment variables. However, this version then needs to be installed manually. This version has other issues related to the shared memory transport that require manually disabling `xpmem`:
+```
+export UCX_SHM_DEVICES=sysv
+```
+There is an additional issue with parallel HDF5 when reading symbolic links that requires disabling the ompio I/O layer of OpenMPI:
+```
+mpirun --mca io romio321 ...
+```
+And there is of course also the fact that you need to manually install OpenMPI 4.1.2 compatible versions of parallel HDF5, FFTW and ParMETIS.
+
+OpenMPI 4.1.2 seems to run quite stable on 8 ranks/node but crashes for lower rank/node numbers. It also occasionally deadlocks for single node runs.
+
+## MPT: Pinning

 The `mpirun` on HAWK automatically tries to pin threads to resources in a way that interferes with SWIFT's own pinning mechanism. Because of this, it is a very bad idea to run SWIFT with `--pin` and default settings over MPI.