Draft: Refactoring of the time-step communication tasks

assigned to @pdraper

changed title from {-Only active the ti_end comms for part and gpart if we are not running with...-} to Refactoring of the time-step communication tasks

changed the description

added 2 commits

bcf49ce8 - Also set the cell->top pointer in the foreign cells
6ea4314c - Add a dependency between the feedback, force, and gravity loops and the new...

Compare with previous version

added 1 commit

d7da32d1 - Add a dependency between the feedback, force, and gravity loops and the new...

Compare with previous version

This passed all my checks and is marginally faster. I have not given it a go on a large number of nodes where one might expect a larger improvement over master.

Right, I will start a round of FLAMINGO tests with this.

There is possibly scope to be more clever about the unskipping and also possibly distribute the unpacking over multiple tasks if these show up as a problem in the graphs but I would a priori not expect that.

The queue on cosma 7 is slow today, so I probably won't have results until next week. But the results from the single node, 2 rank test already show a clear speed gain for the new communications. I still have 8 nodes, 16 ranks and 64 nodes, 128 ranks in the queue, where we would really see the difference.

There is a significant speedup for the larger FLAMINGO boxes:

I'm still producing the task plots and will post them here when they are ready.

Just to be extra cautious, this is not because of the vsig change in the new branch vs. the old?

No, those changes are also included in the reference branch. The two versions shown here are the branch fewer-timestep-comms (fd7aa81938557aaaef235e164475aad819e0973c) and master (5fdee87978c257e5dce5832aa00e84454c1e4ef8) in the FLAMINGO fork:

> git diff 5fdee87978c257e5dce5832aa00e84454c1e4ef8..fd7aa81938557aaaef235e164475aad819e0973c --stat
 src/cell.c                               |  13 +-
 src/cell.h                               |  60 +++----
 src/cell_pack.c                          | 229 ++----------------------
 src/cell_unskip.c                        |  68 -------
 src/engine.c                             |  32 +---
 src/engine_collect_end_of_step.c         | 296 +------------------------------
 src/engine_maketasks.c                   | 167 +++++++----------
 src/engine_marktasks.c                   |  80 ++-------
 src/engine_unskip.c                      |  67 ++-----
 src/power_spectrum.c                     |   6 +-
 src/runner.h                             |   1 +
 src/runner_main.c                        |  25 +--
 src/runner_time_integration.c            |  76 ++++++++
 src/scheduler.c                          |  50 +-----
 src/scheduler.h                          |  22 ++-
 src/space_recycle.c                      |   1 +
 src/task.c                               |   8 +-
 src/task.h                               |   7 +-
 tools/task_plots/swift_hardcoded_data.py |   7 +-
 19 files changed, 264 insertions(+), 951 deletions(-)

Excellent. So I can be extra happy.

As promised: task plots.

Unfortunately, the time steps diverge quite quickly. Step 16 should be the same for all runs. The big step in the early 40s should also be comparable, and there the difference is quite clear (e.g. FLAMINGO 400 step 44 master vs step 46 this branch).

Are these really correct? The collect task is dominating? e.g. https://home.strw.leidenuniv.nl/~vandenbroucke/flamingo/flamingo_400_fewer_comms/step3r0.html#all ?

Or is it a matter of me messing up the plotting labels?

All of these use the label files generated by SWIFT, so the labels are correct.

This is also a very short step.

So I don't know if we expect 1974 collect tasks to be faster than 1 hydro update and a few 10s of gravity calculations?

Ok, then that means we can improve things further. I can maybe be more clever about which top-level cell needs to get the timestep task calculated.

Here are the dead time (and step size) plots for these runs.

Master branch:

This branch:

And a direct comparison of both branches for the FLAMINGO 400 run (16 ranks):

So this is clearly having an impact where it hurts in the big runs.

added 3 commits

2a44ba8a - Do not activate all the local dt-collection tasks. That's too much
e782183a - Activate the dt-collection task for any top-level where a cell is active and...
02eddedb - When activating the timestep-sync or the timestep-limiter task, also activate...

Compare with previous version

That new version might be performing even better on the shortest steps.

closed

reopened

Latest version:

Step size and dead time for FLAMINGO 400 run:

Task plots:

The black has reduced a lot in the small steps and the dead time has gone down a bit in those steps as well.

That looks better indeed. Shame it does not translate into speed improvements. Weird that this newer version is somewhat slower in the non-MPI case.

That could just be because of stochastic changes in the time stepping. I have seen similar jumps in the past.

Right. Makes sense.

I think we can do even better to eliminate the cloud of pink at the end here: https://home.strw.leidenuniv.nl/~vandenbroucke/flamingo/flamingo_400_fewer_comms_new/step42r.html

Though that may not help overall.

mentioned in merge request !1457 (closed)

Draft: Refactoring of the time-step communication tasks

Merge request reports

Activity