Skip to content
Snippets Groups Projects

Add NUMA interleave of memory allocations

Merged Peter W. Draper requested to merge numa-interleave into master

Adds the option to interleave memory allocations uniformly across the NUMA regions which are allowed by the CPU affinity mask.

Seems to help the threadpool when running EAGLE_50 on a single node of COSMA8, see #760.

Edited by Peter W. Draper

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Peter W. Draper changed the description

    changed the description

  • @matthieu using this I don't need to limit the threadpool threads to see good performance. Would be interesting to see if you confirm that.

  • Oh exciting! Any setup you'd think would particularly benefit from this change?

  • Still early days but looks very promising if confirmed.

    EAGLE-25 (top-left label is wrong) with full physics switched on running on cosma7.

    • orange/blue -> master
    • red/green -> This branch with --interleave at runtime
    • orange/red -> 2 nodes with 2 MPI ranks/node
    • green/blue -> 1 node, no MPI

    runtime_a_May

    Edited by Matthieu Schaller
  • Thanks, looks better than I hoped, but confirms that it is all working since the 2 MPI ranks should gain no benefit from this.

    Re: threadpool thread count. Just thought you had some tests from when you added the code to control that count as well as the runners. Clearly only for COSMA8.

  • Playing around on COSMA6 and it looks like there is a problem with using all the NUMA regions, possibly an off-by-one issue.

  • Could there be a difference in the way regions are numbered on different architectures?

  • Here is the same set of runs as above. It definitely looks good.

    The additional line (purple) is 2 nodes with 1 MPI rank / node. So it looks like using 1 rank / NUMA region is still better even when interleaving.

    runtime_a_May

  • I'm currently confused as to why you are seeing this effect on COSMA7 as all my tests show the memory not being interleaved, as the last NUMA node is not being used (OK on COSMA8 where we had plenty).

    Have another implementation that seems to work as expected (use the command numastat -m to see how memory is distributed), that is in the branch numa-interleave-2.

  • I am also very confused by what happened to the "orange" run here. Clear change of performance after it stopped and resumed on a fresh set of nodes. Nothing to do with this branch though but hard to make comparisons.

    runtime_a_May

  • The two sets of "orange" nodes look OK, so doesn't look like a hardware fault issue. The effect is very real as we can see from the time per timestep plots:

    orange.norm

    orange.log

    Must be an MPI or balance issue as the engine_collect_end_of_timesteps goes from <10 ms to <30ms order. That jump in unskip times from 35 to 75 ish ms, suggests MPI since I see we are not repartitioning, so the per node content is pretty much fixed, so the CPU balance cannot change?

    BTW, neither set of nodes was on the same switch and the fabric looks in good health and the "ping" times between the nodes are very similar, so not obvious why that should matter. Maybe the second run was just unfortunate and was sharing bandwidth with a heavy user, not possible to check that.

    Edited by Peter W. Draper
  • But, it seems everything is slower:

    orange.work.log.log

    Must be something else going on here.

  • That is very very weird. unskip, marktasks, scheduler reweight are all not related to MPI. Have they changed?

  • I am on m7324 where the job was running. /proc/cpuinfo reports that all the cores are running at 2.2 GHz. Is turbo-boost switched off?

    Edited by Matthieu Schaller
  • Turbo-boost etc. all look fine and are set to the same across these four nodes, so not obviously.

    Re: timings of the non-MPI bits. Actually they look reasonably, when taken in context:

    engine_reweight:

    orange.reweight

    engine_marktasks:

    orange.marktasks

    Whereas engine_prepare:

    orange.prepare

    Axes are function of step versus time taken in ms.

    Edited by Peter W. Draper
  • Plotting engine_launch timesteps and tasks gives similar plots, so the time is being lost in MPI.

  • The "brown" run has now completed.

    It went from m7205-m7210 to m7194-m7205 when restarting.

    runtime_a_May

  • So... topology matters way way more than I expected. Still can't quite believe it...

    We can move that discussion to another thread. But that mean we should invest more into making sure adjacent domains are on adjacent CPUs/nodes/switches. And, I should make sure I use the --contiguous option.

  • Just chiming in from the sidelines here :)

    First of all, these are awesome results!

    Secondly, topology can matter a lot when there is congestion, e.g. the probability of losing a packet grows exponentially with the number of network hops. Also, the latency per hop will depend on how much traffic there is at the switch.

    Note that while these issues may only affect a small fraction of all packets sent back and forth, if they block for several milliseconds, it will cause stragglers and thus affect the entire simulation.

    Peter, do you have any data on packet loss or latency over the switches as a function of network load? Or can things like packet loss and latency be measured at the switches at runtime (i.e. on the switch itself, outside of swift)?

    Cheers, Pedro

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
Please register or sign in to reply
Loading