Skip to content
Snippets Groups Projects

Simplify, permit user control over affinity

Merged Ghost User requested to merge mpi_and_ht_affinity into master

This ensures we 'do the right thing' when the user imposes affinity through e.g. Intel MPI's I_MPI_PIN_DOMAIN or other mechanisms. Also no longer does works with a shuffled cpuid array, which will hopefully have less surprising failure modes, and means we don't have to handle hyperthreading explicitly. If we want to be clever in this way e.g. to maximise available cache, then we could replace libnuma with hwloc as discussed elsewhere.

Hopefully this gives us a reasonable affinity which is easily overridden using standard techniques.

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Ghost User mentioned in issue #78 (closed)

    mentioned in issue #78 (closed)

  • Thanks @alepper. Can you guide me through the changes and their rationale ?

    Edited by Matthieu Schaller
  • Author Contributor

    No problem.

    We should adhere to any affinity which is applied before we begin. Principle of least surprise, and also allows a user to exert control a) if they know better for some particular machine or b) if they want to experiment with the effect of affinity itself. Within those constraints, we want to do a reasonable job allocating runners to logical processors. At the moment, this is particularly with hardware multithreads and NUMA in mind. Without these changes, we ignore any affinity set by the user or the MPI implementation - we run with their privileges, so we can override anything they've set.

    To fix this, this change limits the list of available processors for runners (the cpuid array) to only those which were in our affinity mask when we first ran. This is the loop at engine.c lines 2024 to 2030. However, there's a catch: we 'first touch' most of our memory (and hence allocate it local to the NUMA node for) before engine_init is called. We would prefer to allocate runners on this same NUMA node [1].

    We pin the main thread at entry (before we touch much memory) so that we can figure out a 'home' (engine.c line 2038) from which we can sort by NUMA distance. Pinning the main thread discards all but one bit of the original affinity mask, so the new engine_pin stores this for later use via the cpuid array as discussed above. engine_entry_affinity ensures the same engine_init code works whether main.c calls engine_pin or not. I remembered a suggestion that a pinned main thread sometimes caused performance problems, so the main thread's affinity is reset (i.e. it is unpinned) at engine.c line 2067, once we're finished thinking about NUMA.

    At the moment, we shuffle (engine.c lines 1990 to 1998) the list of processors we'll allocate to runners. This change removes this, beginning work with a list of processors in order. This change is partly because I'm not sure what the intent of the current code is, which makes it difficult to produce something which achieves the same aim, and partly because I think the shuffle makes our life harder. For example, this shuffle caused problems for you at first on SuperMUC. As I say, if this is trying to e.g. minimise sharing of lower-level cache, I think hwloc would be a better approach.

    Without the shuffle, we don't need to handle hyperthreads directly. Logical processors are numbered in a sensible way, which we had to incorporate as part of the sort order. This change just preserves the original order.

    The change also has each MPI rank - not just rank 0 - print its affinity at entry, and the list of logical processors to which it will allocate runners. The first of these may be useful enough to keep around (as a user, it's nice to check that your controls are having an effect) but the second is just to catch any problems in the near future.

    [1] Maybe better a change to make to the allocation and initialisation, but I think this is likely to be more intrusive. When you're filling the machine, there shouldn't be any difference.

  • Author Contributor

    What remains would be to either simplify further by removing the NUMA part to make this the user's problem, or communicate amongst ranks to also always do the right thing when the affinity they start with overlaps e.g. when neither the user nor MPI implementation change this from the default.

  • Wow, thanks. Much more than I expected. I'll try it out on different machines.

  • Ok, so my tests seemed to have the correct behaviour. I'll handover to @pdraper as he had more experience playing with this than me.

  • Reassigned to @pdraper

  • Note that I tried this both on SuperMUC and Solomon (the machine in Ostrava). The latter does not have hyperthreading. In both cases, top did not show any strange behaviour.

  • Matthieu Schaller Added 296 commits:

    Added 296 commits:

  • Added 1 commit:

  • Added 1 commit:

    • b484ebee - Removed dependency on _GNU_SOURCE
  • Added 1 commit:

    • 9a668d20 - Documentation of the extra functions.
  • Ok, I have merged master into this branch. I have also removed the need for _GNU_SOURCE. It is odd to use 'bool' types just in that one and only function.

    For me, this is ready to go. Over to you @pdraper.

  • Author Contributor

    _GNU_SOURCE is required for sched_getcpu, not bool. I'd suggest restoring that part of the #ifdef before merge.

  • Arrrgggh ! you are right. I foolishly assumed this was part of POSIX.

  • Added 1 commit:

    • c34f643c - Restore _GNU_SOURCE required by sched_getcpu()
  • Looking at the result of these changes on COSMA I'm not sure they are what I would expect, namely that the work is spread equally across all CPUs. What seems to happen is that all the runners end up on the same CPU.

    To make this concrete, submitting a 12 core job to COSMA4 we see runners on cores 0-7 and 12-17, which as the output from cpuinfo reveals:

    =====  Placement on packages  =====
    Package Id.     Core Id.        Processors
    0               0,1,2,8,9,10            (0,12)(1,13)(2,14)(3,15)(4,16)(5,17)
    1               0,1,2,8,9,10            (6,18)(7,19)(8,20)(9,21)(10,22)(11,23)

    Are all on CPU 0, CPU 1 is idle. Previously we ran on cores 0-11 (and yes it ran more quickly).

    Using default settings for Intel MPI here.

  • Author Contributor

    That's unexpected. I don't remember whether I used Intel or Platform MPI, but I first built this on COSMA5. Would you be able to run with verbose mode on, and share the output lines containing 'Affinity at entry' or 'cpu map is'? Thanks a lot!

  • Hi Angus, sure, here are they are:

    [0000] [00000.5] engine_init: Affinity at entry: 111111111111111111111111
    [0000] [00000.5] engine_init: prefer NUMA-local CPUs
    [0000] [00000.5] engine_init: cpu map is [ 0 1 2 3 4 5 12 13 14 15 16 17 6 7 8 9 10 11 18 19 20 21 22 23 ].
  • BTW, I get exactly the same if I use PLATFORM MPI.

  • Peter W. Draper Added 199 commits:

    Added 199 commits:

  • Going a bit stale, so I've brought it up to date with master.

  • Peter W. Draper Title changed from Simplify, permit user control over affinity to WIP:Simplify, permit user control over affinity

    Title changed from Simplify, permit user control over affinity to WIP:Simplify, permit user control over affinity

  • I've changed this to a WIP as I've created a fork of this in the branch affinity-fixes.

    The main change is to sort the cpuids into numa order, so the map picks a core from numa1, then one from numa2, etc. until all the available cores are used. That makes sure you pick hyperthreads last and seems to work well with Intel MPI when you have multiple ranks per node.

    PLATFORM MPI is still not working and I suspect it will remain so, so I've also added a command-line option to disable affinity, regardless of the engine policy.

    Could we try this out on non-COSMA machines.

  • Do you have a specific machine in mind ? Any specification that would help ?

  • I tried an AMD 48 core machine and that worked, so James will look at (secret unknown machine with many cores).

    Edited by Matthieu Schaller
  • @jwillis could you add your findings here once run ?

  • Affinity_Fixes_Results

    Edited by James Willis
  • The changes that @pdraper made in the affinity_fixes branch seemed to have worked, as the code scales better with affinity enabled. Compared with my old run with master that had affinity turned off.

  • Thanks, that looks promising then. I'll update this merge request with these fixes and we'll accept that for now.

  • Peter W. Draper Added 27 commits:

    Added 27 commits:

    • 74fe8fc8...145f97b5 - 25 commits from branch master
    • e64712c9 - Change affinity so that cores from different numa nodes are selected
    • 3d5f0127 - Merge branch 'master' into mpi_and_ht_affinity
  • Changing to @matthieu since this is now a major modification by me.

  • Reassigned to @matthieu

  • Peter W. Draper Title changed from WIP:Simplify, permit user control over affinity to Simplify, permit user control over affinity

    Title changed from WIP:Simplify, permit user control over affinity to Simplify, permit user control over affinity

  • Ok, I'll have a look.

  • Everything works. Just two comments:

    • Do we need to read a parameter after -a from the command line ? Isn't this a straight logical switch as are -s or -g ?
    • Could we add the new parameter to the README as well ?

    Happy to do this changes if you want.

  • I used -a 1|0 as -a 1 is the default, i.e. affinity on, otherwise -a doesn't work as a mnemonic. Personally I'm attracted to having -a 0 as the default, since that gives generally less surprising results, in which case a simple -a would work.

    Think that is a good idea?

  • I agree, we should have no treatment of affinity as the default.

  • Peter W. Draper Added 1 commit:

    Added 1 commit:

    • e48bdf1c - Disable processor affinity by default, this is less surprising more often
  • Good, I've change to just -a and processor affinity is off by default. Also updated the README.

  • Thanks. Accepting now.

  • Matthieu Schaller Status changed to merged

    Status changed to merged

  • mentioned in commit 51cb767b

Please register or sign in to reply
Loading