Simplify, permit user control over affinity
This ensures we 'do the right thing' when the user imposes affinity through e.g. Intel MPI's I_MPI_PIN_DOMAIN or other mechanisms. Also no longer does works with a shuffled cpuid
array, which will hopefully have less surprising failure modes, and means we don't have to handle hyperthreading explicitly. If we want to be clever in this way e.g. to maximise available cache, then we could replace libnuma with hwloc as discussed elsewhere.
Hopefully this gives us a reasonable affinity which is easily overridden using standard techniques.
Merge request reports
Activity
mentioned in issue #78 (closed)
Thanks @alepper. Can you guide me through the changes and their rationale ?
Edited by Matthieu SchallerNo problem.
We should adhere to any affinity which is applied before we begin. Principle of least surprise, and also allows a user to exert control a) if they know better for some particular machine or b) if they want to experiment with the effect of affinity itself. Within those constraints, we want to do a reasonable job allocating runners to logical processors. At the moment, this is particularly with hardware multithreads and NUMA in mind. Without these changes, we ignore any affinity set by the user or the MPI implementation - we run with their privileges, so we can override anything they've set.
To fix this, this change limits the list of available processors for runners (the
cpuid
array) to only those which were in our affinity mask when we first ran. This is the loop at engine.c lines 2024 to 2030. However, there's a catch: we 'first touch' most of our memory (and hence allocate it local to the NUMA node for) beforeengine_init
is called. We would prefer to allocate runners on this same NUMA node [1].We pin the main thread at entry (before we touch much memory) so that we can figure out a 'home' (engine.c line 2038) from which we can sort by NUMA distance. Pinning the main thread discards all but one bit of the original affinity mask, so the new
engine_pin
stores this for later use via thecpuid
array as discussed above.engine_entry_affinity
ensures the sameengine_init
code works whether main.c callsengine_pin
or not. I remembered a suggestion that a pinned main thread sometimes caused performance problems, so the main thread's affinity is reset (i.e. it is unpinned) at engine.c line 2067, once we're finished thinking about NUMA.At the moment, we shuffle (engine.c lines 1990 to 1998) the list of processors we'll allocate to runners. This change removes this, beginning work with a list of processors in order. This change is partly because I'm not sure what the intent of the current code is, which makes it difficult to produce something which achieves the same aim, and partly because I think the shuffle makes our life harder. For example, this shuffle caused problems for you at first on SuperMUC. As I say, if this is trying to e.g. minimise sharing of lower-level cache, I think hwloc would be a better approach.
Without the shuffle, we don't need to handle hyperthreads directly. Logical processors are numbered in a sensible way, which we had to incorporate as part of the sort order. This change just preserves the original order.
The change also has each MPI rank - not just rank 0 - print its affinity at entry, and the list of logical processors to which it will allocate runners. The first of these may be useful enough to keep around (as a user, it's nice to check that your controls are having an effect) but the second is just to catch any problems in the near future.
[1] Maybe better a change to make to the allocation and initialisation, but I think this is likely to be more intrusive. When you're filling the machine, there shouldn't be any difference.
What remains would be to either simplify further by removing the NUMA part to make this the user's problem, or communicate amongst ranks to also always do the right thing when the affinity they start with overlaps e.g. when neither the user nor MPI implementation change this from the default.
Ok, so my tests seemed to have the correct behaviour. I'll handover to @pdraper as he had more experience playing with this than me.
Reassigned to @pdraper
Added 296 commits:
-
ab0224b1...cbefe349 - 295 commits from branch
master
- 5db7cd91 - Merge branch 'master' into mpi_and_ht_affinity
-
ab0224b1...cbefe349 - 295 commits from branch
Added 1 commit:
- b16424e9 - Post-merge fixes
Added 1 commit:
- b484ebee - Removed dependency on _GNU_SOURCE
Added 1 commit:
- 9a668d20 - Documentation of the extra functions.
Ok, I have merged master into this branch. I have also removed the need for _GNU_SOURCE. It is odd to use 'bool' types just in that one and only function.
For me, this is ready to go. Over to you @pdraper.
Added 1 commit:
- c34f643c - Restore _GNU_SOURCE required by sched_getcpu()
Looking at the result of these changes on COSMA I'm not sure they are what I would expect, namely that the work is spread equally across all CPUs. What seems to happen is that all the runners end up on the same CPU.
To make this concrete, submitting a 12 core job to COSMA4 we see runners on cores 0-7 and 12-17, which as the output from cpuinfo reveals:
===== Placement on packages ===== Package Id. Core Id. Processors 0 0,1,2,8,9,10 (0,12)(1,13)(2,14)(3,15)(4,16)(5,17) 1 0,1,2,8,9,10 (6,18)(7,19)(8,20)(9,21)(10,22)(11,23)
Are all on CPU 0, CPU 1 is idle. Previously we ran on cores 0-11 (and yes it ran more quickly).
Using default settings for Intel MPI here.
Added 199 commits:
-
c34f643c...e5a9a253 - 197 commits from branch
master
- 5cc8a4d0 - Merge branch 'master' into mpi_and_ht_affinity
- 74fe8fc8 - Output cpuid assignments when verbose
-
c34f643c...e5a9a253 - 197 commits from branch
I've changed this to a WIP as I've created a fork of this in the branch
affinity-fixes
.The main change is to sort the cpuids into numa order, so the map picks a core from numa1, then one from numa2, etc. until all the available cores are used. That makes sure you pick hyperthreads last and seems to work well with Intel MPI when you have multiple ranks per node.
PLATFORM MPI is still not working and I suspect it will remain so, so I've also added a command-line option to disable affinity, regardless of the engine policy.
Could we try this out on non-COSMA machines.
I tried an AMD 48 core machine and that worked, so James will look at (secret unknown machine with many cores).
Edited by Matthieu Schaller@jwillis could you add your findings here once run ?
- Edited by James Willis
The changes that @pdraper made in the affinity_fixes branch seemed to have worked, as the code scales better with affinity enabled. Compared with my old run with master that had affinity turned off.
Added 27 commits:
-
74fe8fc8...145f97b5 - 25 commits from branch
master
- e64712c9 - Change affinity so that cores from different numa nodes are selected
- 3d5f0127 - Merge branch 'master' into mpi_and_ht_affinity
-
74fe8fc8...145f97b5 - 25 commits from branch
Changing to @matthieu since this is now a major modification by me.
Reassigned to @matthieu
Added 1 commit:
- e48bdf1c - Disable processor affinity by default, this is less surprising more often
mentioned in commit 51cb767b