Simplify, permit user control over affinity

mentioned in issue #78 (closed)

Thanks @alepper. Can you guide me through the changes and their rationale ?

No problem.

We should adhere to any affinity which is applied before we begin. Principle of least surprise, and also allows a user to exert control a) if they know better for some particular machine or b) if they want to experiment with the effect of affinity itself. Within those constraints, we want to do a reasonable job allocating runners to logical processors. At the moment, this is particularly with hardware multithreads and NUMA in mind. Without these changes, we ignore any affinity set by the user or the MPI implementation - we run with their privileges, so we can override anything they've set.

To fix this, this change limits the list of available processors for runners (the cpuid array) to only those which were in our affinity mask when we first ran. This is the loop at engine.c lines 2024 to 2030. However, there's a catch: we 'first touch' most of our memory (and hence allocate it local to the NUMA node for) before engine_init is called. We would prefer to allocate runners on this same NUMA node [1].

We pin the main thread at entry (before we touch much memory) so that we can figure out a 'home' (engine.c line 2038) from which we can sort by NUMA distance. Pinning the main thread discards all but one bit of the original affinity mask, so the new engine_pin stores this for later use via the cpuid array as discussed above. engine_entry_affinity ensures the same engine_init code works whether main.c calls engine_pin or not. I remembered a suggestion that a pinned main thread sometimes caused performance problems, so the main thread's affinity is reset (i.e. it is unpinned) at engine.c line 2067, once we're finished thinking about NUMA.

At the moment, we shuffle (engine.c lines 1990 to 1998) the list of processors we'll allocate to runners. This change removes this, beginning work with a list of processors in order. This change is partly because I'm not sure what the intent of the current code is, which makes it difficult to produce something which achieves the same aim, and partly because I think the shuffle makes our life harder. For example, this shuffle caused problems for you at first on SuperMUC. As I say, if this is trying to e.g. minimise sharing of lower-level cache, I think hwloc would be a better approach.

Without the shuffle, we don't need to handle hyperthreads directly. Logical processors are numbered in a sensible way, which we had to incorporate as part of the sort order. This change just preserves the original order.

The change also has each MPI rank - not just rank 0 - print its affinity at entry, and the list of logical processors to which it will allocate runners. The first of these may be useful enough to keep around (as a user, it's nice to check that your controls are having an effect) but the second is just to catch any problems in the near future.

[1] Maybe better a change to make to the allocation and initialisation, but I think this is likely to be more intrusive. When you're filling the machine, there shouldn't be any difference.

What remains would be to either simplify further by removing the NUMA part to make this the user's problem, or communicate amongst ranks to also always do the right thing when the affinity they start with overlaps e.g. when neither the user nor MPI implementation change this from the default.

Wow, thanks. Much more than I expected. I'll try it out on different machines.

Ok, so my tests seemed to have the correct behaviour. I'll handover to @pdraper as he had more experience playing with this than me.

Reassigned to @pdraper

Note that I tried this both on SuperMUC and Solomon (the machine in Ostrava). The latter does not have hyperthreading. In both cases, top did not show any strange behaviour.

Added 296 commits:

ab0224b1...cbefe349 - 295 commits from branch master
5db7cd91 - Merge branch 'master' into mpi_and_ht_affinity

Added 1 commit:

b16424e9 - Post-merge fixes

Added 1 commit:

b484ebee - Removed dependency on _GNU_SOURCE

Added 1 commit:

9a668d20 - Documentation of the extra functions.

Ok, I have merged master into this branch. I have also removed the need for _GNU_SOURCE. It is odd to use 'bool' types just in that one and only function.

For me, this is ready to go. Over to you @pdraper.

_GNU_SOURCE is required for sched_getcpu, not bool. I'd suggest restoring that part of the #ifdef before merge.

Arrrgggh ! you are right. I foolishly assumed this was part of POSIX.

Added 1 commit:

c34f643c - Restore _GNU_SOURCE required by sched_getcpu()

Restored. :)

Looking at the result of these changes on COSMA I'm not sure they are what I would expect, namely that the work is spread equally across all CPUs. What seems to happen is that all the runners end up on the same CPU.

To make this concrete, submitting a 12 core job to COSMA4 we see runners on cores 0-7 and 12-17, which as the output from cpuinfo reveals:

=====  Placement on packages  =====
Package Id.     Core Id.        Processors
0               0,1,2,8,9,10            (0,12)(1,13)(2,14)(3,15)(4,16)(5,17)
1               0,1,2,8,9,10            (6,18)(7,19)(8,20)(9,21)(10,22)(11,23)

Are all on CPU 0, CPU 1 is idle. Previously we ran on cores 0-11 (and yes it ran more quickly).

Using default settings for Intel MPI here.

That's unexpected. I don't remember whether I used Intel or Platform MPI, but I first built this on COSMA5. Would you be able to run with verbose mode on, and share the output lines containing 'Affinity at entry' or 'cpu map is'? Thanks a lot!

Hi Angus, sure, here are they are:

[0000] [00000.5] engine_init: Affinity at entry: 111111111111111111111111
[0000] [00000.5] engine_init: prefer NUMA-local CPUs
[0000] [00000.5] engine_init: cpu map is [ 0 1 2 3 4 5 12 13 14 15 16 17 6 7 8 9 10 11 18 19 20 21 22 23 ].

BTW, I get exactly the same if I use PLATFORM MPI.

Added 199 commits:

c34f643c...e5a9a253 - 197 commits from branch master
5cc8a4d0 - Merge branch 'master' into mpi_and_ht_affinity
74fe8fc8 - Output cpuid assignments when verbose

Going a bit stale, so I've brought it up to date with master.

Title changed from Simplify, permit user control over affinity to WIP:Simplify, permit user control over affinity

I've changed this to a WIP as I've created a fork of this in the branch affinity-fixes.

The main change is to sort the cpuids into numa order, so the map picks a core from numa1, then one from numa2, etc. until all the available cores are used. That makes sure you pick hyperthreads last and seems to work well with Intel MPI when you have multiple ranks per node.

PLATFORM MPI is still not working and I suspect it will remain so, so I've also added a command-line option to disable affinity, regardless of the engine policy.

Could we try this out on non-COSMA machines.

Do you have a specific machine in mind ? Any specification that would help ?

I tried an AMD 48 core machine and that worked, so James will look at (secret unknown machine with many cores).

@jwillis could you add your findings here once run ?

The changes that @pdraper made in the affinity_fixes branch seemed to have worked, as the code scales better with affinity enabled. Compared with my old run with master that had affinity turned off.

Thanks, that looks promising then. I'll update this merge request with these fixes and we'll accept that for now.

Added 27 commits:

74fe8fc8...145f97b5 - 25 commits from branch master
e64712c9 - Change affinity so that cores from different numa nodes are selected
3d5f0127 - Merge branch 'master' into mpi_and_ht_affinity

Changing to @matthieu since this is now a major modification by me.

Reassigned to @matthieu

Title changed from WIP:Simplify, permit user control over affinity to Simplify, permit user control over affinity

Ok, I'll have a look.

Everything works. Just two comments:

Do we need to read a parameter after -a from the command line ? Isn't this a straight logical switch as are -s or -g ?
Could we add the new parameter to the README as well ?

Happy to do this changes if you want.

I used -a 1|0 as -a 1 is the default, i.e. affinity on, otherwise -a doesn't work as a mnemonic. Personally I'm attracted to having -a 0 as the default, since that gives generally less surprising results, in which case a simple -a would work.

Think that is a good idea?

I agree, we should have no treatment of affinity as the default.

Added 1 commit:

e48bdf1c - Disable processor affinity by default, this is less surprising more often

Good, I've change to just -a and processor affinity is off by default. Also updated the README.

Thanks. Accepting now.

Status changed to merged

mentioned in commit 51cb767b

Simplify, permit user control over affinity

Merge request reports

Activity