Add NUMA interleave of memory allocations

added engineering enhancement memory usage performance labels

changed the description

@matthieu using this I don't need to limit the threadpool threads to see good performance. Would be interesting to see if you confirm that.

Oh exciting! Any setup you'd think would particularly benefit from this change?

Still early days but looks very promising if confirmed.

EAGLE-25 (top-left label is wrong) with full physics switched on running on cosma7.

orange/blue -> master
red/green -> This branch with --interleave at runtime
orange/red -> 2 nodes with 2 MPI ranks/node
green/blue -> 1 node, no MPI

Thanks, looks better than I hoped, but confirms that it is all working since the 2 MPI ranks should gain no benefit from this.

Re: threadpool thread count. Just thought you had some tests from when you added the code to control that count as well as the runners. Clearly only for COSMA8.

Playing around on COSMA6 and it looks like there is a problem with using all the NUMA regions, possibly an off-by-one issue.

Could there be a difference in the way regions are numbered on different architectures?

Here is the same set of runs as above. It definitely looks good.

The additional line (purple) is 2 nodes with 1 MPI rank / node. So it looks like using 1 rank / NUMA region is still better even when interleaving.

I'm currently confused as to why you are seeing this effect on COSMA7 as all my tests show the memory not being interleaved, as the last NUMA node is not being used (OK on COSMA8 where we had plenty).

Have another implementation that seems to work as expected (use the command numastat -m to see how memory is distributed), that is in the branch numa-interleave-2.

I am also very confused by what happened to the "orange" run here. Clear change of performance after it stopped and resumed on a fresh set of nodes. Nothing to do with this branch though but hard to make comparisons.

The two sets of "orange" nodes look OK, so doesn't look like a hardware fault issue. The effect is very real as we can see from the time per timestep plots:

Must be an MPI or balance issue as the engine_collect_end_of_timesteps goes from <10 ms to <30ms order. That jump in unskip times from 35 to 75 ish ms, suggests MPI since I see we are not repartitioning, so the per node content is pretty much fixed, so the CPU balance cannot change?

BTW, neither set of nodes was on the same switch and the fabric looks in good health and the "ping" times between the nodes are very similar, so not obvious why that should matter. Maybe the second run was just unfortunate and was sharing bandwidth with a heavy user, not possible to check that.

But, it seems everything is slower:

Must be something else going on here.

That is very very weird. unskip, marktasks, scheduler reweight are all not related to MPI. Have they changed?

I am on m7324 where the job was running. /proc/cpuinfo reports that all the cores are running at 2.2 GHz. Is turbo-boost switched off?

Turbo-boost etc. all look fine and are set to the same across these four nodes, so not obviously.

Re: timings of the non-MPI bits. Actually they look reasonably, when taken in context:

engine_reweight:

engine_marktasks:

Whereas engine_prepare:

Axes are function of step versus time taken in ms.

Plotting engine_launch timesteps and tasks gives similar plots, so the time is being lost in MPI.

The "brown" run has now completed.

It went from m7205-m7210 to m7194-m7205 when restarting.

So... topology matters way way more than I expected. Still can't quite believe it...

We can move that discussion to another thread. But that mean we should invest more into making sure adjacent domains are on adjacent CPUs/nodes/switches. And, I should make sure I use the --contiguous option.

Just chiming in from the sidelines here :)

First of all, these are awesome results!

Secondly, topology can matter a lot when there is congestion, e.g. the probability of losing a packet grows exponentially with the number of network hops. Also, the latency per hop will depend on how much traffic there is at the switch.

Note that while these issues may only affect a small fraction of all packets sent back and forth, if they block for several milliseconds, it will cause stragglers and thus affect the entire simulation.

Peter, do you have any data on packet loss or latency over the switches as a function of network load? Or can things like packet loss and latency be measured at the switches at runtime (i.e. on the switch itself, outside of swift)?

Cheers, Pedro

Add NUMA interleave of memory allocations

Merge request reports

Activity