Draft: Refactoring of the time-step communication tasks
Significant re-factoring of the way the time-step sizes are being exchanged.
Summary:
- A new top-level task collects the time-step sizes from the super level to the top-level. This was formerly done by making
engine_collect_end_of_step()
recurse. - The
timestep
,timestep_limiter
, andtimestep_sync
tasks all unlock that top-level task. -
engine_collect_end_of_step()
now only loops (via threadpool) over the local top-level cells. No recursion any more. - For each pair of top-level cells in the proxies we construct a pair of send/recv comm tasks.
- That comm task packs up the dt of the whole hierarchy sends it and unpacks the time-step sizes.
- The top-level time-step collection task unlocks the send.
- The individual per-species
tend
communication tasks that used to live at the super level are removed. - The second call to
engine_launch()
done every step to deal with the timestep limiter effect is removed (as it is now properly dealt with by the top-level task dependency)
This should help speed up the smallest steps by reducing the level of the plateau we usually see in the "main sequence" plots.
Merge request reports
Activity
added MPI performance labels
requested review from @bvandenbroucke
assigned to @pdraper
added 1 commit
- d7da32d1 - Add a dependency between the feedback, force, and gravity loops and the new...
The queue on cosma 7 is slow today, so I probably won't have results until next week. But the results from the single node, 2 rank test already show a clear speed gain for the new communications. I still have 8 nodes, 16 ranks and 64 nodes, 128 ranks in the queue, where we would really see the difference.
No, those changes are also included in the reference branch. The two versions shown here are the branch
fewer-timestep-comms
(fd7aa81938557aaaef235e164475aad819e0973c
) andmaster
(5fdee87978c257e5dce5832aa00e84454c1e4ef8
) in the FLAMINGO fork:> git diff 5fdee87978c257e5dce5832aa00e84454c1e4ef8..fd7aa81938557aaaef235e164475aad819e0973c --stat src/cell.c | 13 +- src/cell.h | 60 +++---- src/cell_pack.c | 229 ++---------------------- src/cell_unskip.c | 68 ------- src/engine.c | 32 +--- src/engine_collect_end_of_step.c | 296 +------------------------------ src/engine_maketasks.c | 167 +++++++---------- src/engine_marktasks.c | 80 ++------- src/engine_unskip.c | 67 ++----- src/power_spectrum.c | 6 +- src/runner.h | 1 + src/runner_main.c | 25 +-- src/runner_time_integration.c | 76 ++++++++ src/scheduler.c | 50 +----- src/scheduler.h | 22 ++- src/space_recycle.c | 1 + src/task.c | 8 +- src/task.h | 7 +- tools/task_plots/swift_hardcoded_data.py | 7 +- 19 files changed, 264 insertions(+), 951 deletions(-)
As promised: task plots.
Unfortunately, the time steps diverge quite quickly. Step 16 should be the same for all runs. The big step in the early 40s should also be comparable, and there the difference is quite clear (e.g. FLAMINGO 400 step 44 master vs step 46 this branch).
Are these really correct? The
collect
task is dominating? e.g. https://home.strw.leidenuniv.nl/~vandenbroucke/flamingo/flamingo_400_fewer_comms/step3r0.html#all ?
That looks better indeed. Shame it does not translate into speed improvements. Weird that this newer version is somewhat slower in the non-MPI case.
Edited by Matthieu SchallerRight. Makes sense.
I think we can do even better to eliminate the cloud of pink at the end here: https://home.strw.leidenuniv.nl/~vandenbroucke/flamingo/flamingo_400_fewer_comms_new/step42r.html
Though that may not help overall.
mentioned in merge request !1457 (closed)