Bert Vandenbroucke · 9190d949
--- a/HLRS-HAWK-(AMD).md
+++ b/HLRS-HAWK-(AMD).md
@@ -105,6 +105,29 @@ export OMP_NUM_THREADS=64
 ./swift --cosmology --eagle --threads=64 --pin  eagle_12.yml -v 1
 ```

+# Pinning and MPI
+
+The `mpirun` on HAWK automatically tries to pin threads to resources in a way that interferes with SWIFT's own pinning mechanism. Because of this, it is a very bad idea to run SWIFT with `--pin` and default settings over MPI.
+
+If you run one MPI rank per node, it should suffice to disable the default pinning behaviour by disabling the environment variable `MPI_DSM_DISTRIBUTE` as part of the job script:
+
+```
+export MPI_DSM_DISTRIBUTE=False
+```
+
+This will launch the master thread for each MPI rank on an arbitrary core on the node and give that thread full access to all other cores so that SWIFT can do its own pinning. With the default value `MPI_DSM_DISTRIBUTE=True`, each master thread would only see one core, and SWIFT would be forced to launch all its threads on that single core, effectively running with worse performance than if it were running in serial.
+
+When running multiple ranks on a node, disabling `MPI_DSM_DISTRIBUTE` only works if you run SWIFT without `--pin`, which also degrades performance. A more efficient alternative is to keep `MPI_DSM_DISTRIBUTE=True`, run without `--pin`, but add additional directives to the `mpirun` command so that threads are appropriately pinned. To run with 2 ranks, each using 64 cores, the appropriate command is:
+
+```
+mpirun -np 2 omplace -nt 64 -tm pthreads \
+  ./swift_mpi --threads=64 ...
+```
+
+The `omplace` command will pin the threads in blocks of 64 cores per rank. The first block will be assigned to rank 0 and the second one to rank 1. This will pin the task threads to the first 128 physical cores. The threadpool by default also uses 64 threads, which will be pinned to the next 64 cores on the ranks, corresponding to the 128 hyper-threaded cores (effectively using the same physical cores as the task threads). `-tm pthreads` tells `omplace` that SWIFT will spawn these threads using `pthreads`.
+
+`omplace` allows for more complex pinning instructions than covered here. However, when using those, keep in mind that SWIFT uses more threads than the value given in the `--threads` argument, because the threadpool spawns additional threads. For optimal performance, you want to make sure that those additional threads are pinned to hyper-cores, so that they reuse the same physical cores as the task threads. You should avoid pinning task threads and threadpool tasks to a mixture of cores and hyper-cores at all cost, because then threads will be simultaneously using the same physical cores. In the default `omplace` CPU layout, the first 128 cores are physical cores, and the next 128 are the hyper-cores. The CPUs itself are bound to 2 physical sockets, with the first 64 being bound to the first socket and so on. Also make sure to set `-tm pthreads`, since the default pinning model used by `omplace` assumes the use of OpenMP which is not appropriate for SWIFT.
+
 # Python Tooling

 As HAWK has no connection to the outside world, you need to use a tunnel to transfer pip data. They also do not seem to correctly support python environments so we will user install everything.