Significant re-factoring of the way the time-step sizes are being exchanged.
timestep_sync tasks all unlock that top-level task.
engine_collect_end_of_step() now only loops (via threadpool) over the local top-level cells. No recursion any more.
tend communication tasks that used to live at the super level are removed.
engine_launch() done every step to deal with the timestep limiter effect is removed (as it is now properly dealt with by the top-level task dependency)
This should help speed up the smallest steps by reducing the level of the plateau we usually see in the "main sequence" plots.