Drift on demand

Here is a suggestion to improve the scalability of the drifting process.

runner_do_drift() does not do the drift any more. Just the un-skipping of the tasks. (And we rename the function) That should be really fast as @jwillis tests showed (provided we trust vTune...).
When we pick-up a task (say a density-pair or a density-self) we first check whether the cells are at the current point in time. If not, we start by drifting the cell. Or even better, only the relevant component: part, gpart or multipole depending on the type of the task. Then we mark the cell as being at the current time and keep going with the normal operations this task should do.

This pushes the drift into the very scalable part of the code and should help reduce the time spent in the overheads.

The advantage this has is that if you only have one active cell (say) then most of its 26 neighbours will be part of the same thread in the drifter threadpool, meaning that we can't parallelize the drift very well. Whilst if the drift is part of the tasks, then this same operation will be spread over all the threads that pick up a pair task. Plus, we reduce the bandwidth need in runner_do_drift(). The task needs to read the particle data anyway so that won't change the memory needs there.

The only aspect which can be problematic is the handling of the rebuilding test and the fact that some tasks may randomly become more expensive than the naive expectation. But that should be OK since these tasks will be at the start of the batch and we will have many more tasks in the graph to compensate.

Any thoughts ?

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information