Add NUMA interleave of memory allocations
Adds the option to interleave memory allocations uniformly across the NUMA regions which are allowed by the CPU affinity mask.
Seems to help the threadpool when running EAGLE_50 on a single node of COSMA8, see #760.
Merge request reports
Activity
added engineering enhancement memory usage performance labels
@matthieu using this I don't need to limit the threadpool threads to see good performance. Would be interesting to see if you confirm that.
Still early days but looks very promising if confirmed.
EAGLE-25 (top-left label is wrong) with full physics switched on running on cosma7.
- orange/blue -> master
- red/green -> This branch with
--interleave
at runtime - orange/red -> 2 nodes with 2 MPI ranks/node
- green/blue -> 1 node, no MPI
Edited by Matthieu SchallerThanks, looks better than I hoped, but confirms that it is all working since the 2 MPI ranks should gain no benefit from this.
Re: threadpool thread count. Just thought you had some tests from when you added the code to control that count as well as the runners. Clearly only for COSMA8.
I'm currently confused as to why you are seeing this effect on COSMA7 as all my tests show the memory not being interleaved, as the last NUMA node is not being used (OK on COSMA8 where we had plenty).
Have another implementation that seems to work as expected (use the command
numastat -m
to see how memory is distributed), that is in the branch numa-interleave-2.The two sets of "orange" nodes look OK, so doesn't look like a hardware fault issue. The effect is very real as we can see from the time per timestep plots:
Must be an MPI or balance issue as the
engine_collect_end_of_timesteps
goes from <10 ms to <30ms order. That jump in unskip times from 35 to 75 ish ms, suggests MPI since I see we are not repartitioning, so the per node content is pretty much fixed, so the CPU balance cannot change?BTW, neither set of nodes was on the same switch and the fabric looks in good health and the "ping" times between the nodes are very similar, so not obvious why that should matter. Maybe the second run was just unfortunate and was sharing bandwidth with a heavy user, not possible to check that.
Edited by Peter W. DraperI am on m7324 where the job was running.
/proc/cpuinfo
reports that all the cores are running at 2.2 GHz. Is turbo-boost switched off?Edited by Matthieu SchallerTurbo-boost etc. all look fine and are set to the same across these four nodes, so not obviously.
Re: timings of the non-MPI bits. Actually they look reasonably, when taken in context:
engine_reweight:
engine_marktasks:
Whereas engine_prepare:
Axes are function of step versus time taken in ms.
Edited by Peter W. DraperJust chiming in from the sidelines here :)
First of all, these are awesome results!
Secondly, topology can matter a lot when there is congestion, e.g. the probability of losing a packet grows exponentially with the number of network hops. Also, the latency per hop will depend on how much traffic there is at the switch.
Note that while these issues may only affect a small fraction of all packets sent back and forth, if they block for several milliseconds, it will cause stragglers and thus affect the entire simulation.
Peter, do you have any data on packet loss or latency over the switches as a function of network load? Or can things like packet loss and latency be measured at the switches at runtime (i.e. on the switch itself, outside of
swift
)?Cheers, Pedro