The following plots are an attempt to understand why we scale less well now we have multi-dt.
The test was a run of the master branch (at a4cb085a, 6/10/2016) on the EAGLE_25
volume (50 million particles) for 2000 steps using the default run.sh script, producing the usual timesteps_12.txt file.
First we show a plot of the number of particles updated per time step against
the time taken per time step (both axes are log'd).
The interesting thing is that we are linear down to ~1000 particles, at which
point we don't get much faster for fewer particles.
This matters because:
which is a cumulative histogram of updated particles, so we see that ~1000 particle updates
and less account for half the time steps.
Richard and I initially speculated that this might best be explained by the fact that the number
of tasks, a proxy for cells, is roughly the same for these steps (seemed that way
to the eye). So we could be processing the same number of cells for different numbers
of particles... So a plot of active tasks per step against updated particles:
Looks quite linear down to ~500 particles, when we loose good scaling, that is still 40%
of all steps, but we are still somewhat linear.
Next plot, same as above but now comparing the 12 core run with a 1 core run:
So we see the same effect. The scaling in the flat section is x2.
Hah, but this is the master branch and the threadpool chunks tasks typically at
the 1000-10000 scale, so maybe that is the issue and we're not using all the
No, not that either. Pulling in the latest threadpool chunking scheme gives
no change (at all).
However, the threadpool is giving some effect. In the next plot we have, in blue,
added a 12 core run that only uses one threadpool thread. This converges to the
single core times.
Interesting that some threadpool speed up is seen at 100000 updates.
Digging deeper and breaking down the fraction of time spent during the drift and
engine launch for the runners, we get:
So we see that the work per update (same as task) behaves similarly for the runners
and the drift. It was pointed out that the time during engine launch was not just
for the runners, and included some threadpool (and serial) parts as well. So the
flattening may not be for the runners.
Next we show what happens to the timings if we change the cell_split_size: value,
that determines the smallest cell particle counts, from the default of 400 to 4000.
We seem to scale better (fewer cells to process), but are much slower (more processing
A week later and we now have the mark tasks in drift and scheduler skip branches
merged, so here is the progress plot.
Clearly some very good progress.
The next plot shows what happens if we reduce the number of top-level cells from 30x30x30
to 16x16x16 (so up to 12384 particles per cell from 1878) by requiring SPH:max_smoothing_length: 1.0 (i.e. h_max=1):
That also looks like a good improvement. Seems to be an odd break at around 200 particles.
Note this is using h_max=1 on the drift+skip branch, not master.
On further investigation it seems that the 200 particle break occurs when we get to time
step 1400 to the end of the run 2000. So is probably a geometric issue, like more active cells
for the same number of particle updates. In the next plot we see step plotted
against time per step with all the particles to the left of the break highlighted
in magenta (the lower track) and some of those immediately to the right in green (high region):
Looking for further clues, and wondering what a single thread result will look like,
we can check how the actual scaling affects the graphs by running with different numbers of cores.
Note these runs now use h_max=1. The horizontal line at 70ms are all the steps that
If we assume that these scale by number of cores (which is somewhat untrue since
the node used had turbo boost enabled, so small cores will clock faster) we get
for a select few:
And just to clear things up here are just the 1 and 12 core runs alone: