Network Configuration Effects
Morning all,
I have been running some tests on COSMA-7 to see if the extra latency introduced when doing a triple-hop (node - switch - top-level switch - switch - node) v.s. a single-hop (node - switch - node) affects SWIFT. We would hope that it doesn't, otherwise we might be wise to include the network topology into our domain decomposition.
Set-up
Here I present a worst-case scenario. SWIFT, using tbbmalloc
and Gadget-2 hydro (so we have as little work as possible), running with the grid domain decomposition. This should give us, pretty much, minimal work with maximum communication.
In terms of particle set-up, I used the EAGLE-50r2 (i.e. EAGLE-50 copied 8 times), with two boxes on each node, for a total of 4 nodes.
SWIFT was invoked with
mpirun -np 8 ../swift_mpi -c -a -s -n 8192 -t 14 eagle_50.yml
as each node on C7 has 2x14 core chips.
Results
Summary: We don't care about latency! Asynchronous communication works.
The interesting thing here is the following histogram. It's the classic cumulative wall-clock time used against number of particles updated in each step.
Normally we say that we get "killed" in the big steps. However, I think an apt metaphor here is that we get "absolutely murdered" in the big steps - the last jump up is caused where every particle in the system is updated and synchronised.
In terms of the tasks that each of the runs actually produces, I think this comparison movie is helpful: out
At the moment I do not understand why using a single hop we typically get more updates for our money. I'm confident it's not something to do with the plot because the 1 particle updated little blob always stays in the same place. Unfortunately I didn't keep the output from the simulation, so I can't check if they got "the same answer". Based on these histograms, I would guess not.
The following histogram shows the breakdown of the runtimes for each of the runs. Notice how similar they look.
In summary: It would probably be a waste of time considering the network topology in the redistribute/domain decomposition, based on these results.
We can discuss this at the telecon.