Atomic, i.e. non-locking, tasks
Just an idea to squeeze some more parallelism out of the smaller timesteps: Can we make variants of the density and force tasks that update the particle data atomically, and thus do not need to lock the underlying cells?
If we had such variants, we could set a flag in the scheduler that, when set, does not lock cells for certain task types, and calls the atomic task functions, e.g. when we have less than a certain number of active tasks.
While the atomics may be slower than regular writes, we would only be using them in cases when very few threads update very few particles, so it wouldn't be that bad.
Another option would be, if we can implement atomic updates efficiently, to always use the atomic functions. This is effectively what Aidan does on the GPUs, where atomics are super-cheap.
Note that this doesn't save us when we have all interactions within a single sub-cell task, but we could think of creative ways of breaking these up should the need arise.