The following plots are an attempt to understand why we scale less well now we have multi-dt.
The test was a run of the master branch (at a4cb085a, 6/10/2016) on the
volume (50 million particles) for 2000 steps using the default
run.sh script, producing the usual
First we show a plot of the number of particles updated per time step against the time taken per time step (both axes are log'd).
The interesting thing is that we are linear down to ~1000 particles, at which point we don't get much faster for fewer particles.
This matters because:
which is a cumulative histogram of updated particles, so we see that ~1000 particle updates and less account for half the time steps.
Richard and I initially speculated that this might best be explained by the fact that the number of tasks, a proxy for cells, is roughly the same for these steps (seemed that way to the eye). So we could be processing the same number of cells for different numbers of particles... So a plot of active tasks per step against updated particles:
Looks quite linear down to ~500 particles, when we loose good scaling, that is still 40% of all steps, but we are still somewhat linear.
Next plot, same as above but now comparing the 12 core run with a 1 core run:
So we see the same effect. The scaling in the flat section is x2.
Hah, but this is the master branch and the threadpool chunks tasks typically at the 1000-10000 scale, so maybe that is the issue and we're not using all the threads available.
No, not that either. Pulling in the latest threadpool chunking scheme gives no change (at all).
However, the threadpool is giving some effect. In the next plot we have, in blue, added a 12 core run that only uses one threadpool thread. This converges to the single core times.
Interesting that some threadpool speed up is seen at 100000 updates.
Digging deeper and breaking down the fraction of time spent during the drift and engine launch for the runners, we get:
So we see that the work per update (same as task) behaves similarly for the runners
and the drift. It was pointed out that the time during engine launch was not just
runners, and included some threadpool (and serial) parts as well. So the
flattening may not be for the runners.
Next we show what happens to the timings if we change the
that determines the smallest cell particle counts, from the default of 400 to 4000.
We seem to scale better (fewer cells to process), but are much slower (more processing per interaction).
A week later and we now have the mark tasks in drift and scheduler skip branches merged, so here is the progress plot.
Clearly some very good progress.
The next plot shows what happens if we reduce the number of top-level cells from 30x30x30
to 16x16x16 (so up to 12384 particles per cell from 1878) by requiring
SPH:max_smoothing_length: 1.0 (i.e.
That also looks like a good improvement. Seems to be an odd break at around 200 particles.
Note this is using
h_max=1 on the drift+skip branch, not master.
On further investigation it seems that the 200 particle break occurs when we get to time step 1400 to the end of the run 2000. So is probably a geometric issue, like more active cells for the same number of particle updates. In the next plot we see step plotted against time per step with all the particles to the left of the break highlighted in magenta (the lower track) and some of those immediately to the right in green (high region):
Looking for further clues, and wondering what a single thread result will look like, we can check how the actual scaling affects the graphs by running with different numbers of cores.
Note these runs now use
h_max=1. The horizontal line at 70ms are all the steps that
If we assume that these scale by number of cores (which is somewhat untrue since the node used had turbo boost enabled, so small cores will clock faster) we get for a select few:
And just to clear things up here are just the 1 and 12 core runs alone: