Only activate the tend comms that are needed
In large simulations, especially with small time-steps, we swamp the system with tend
communications at the end of a step.
That is because in the current logic we don't have a good way of deciding which ones to launch (because of the complexities of the sync + limiter) and so decided to activate all of them. That can be N^3 * 125 communications, where N is the number of top-level cells on one side. That's 4e6 comms in a 32^3 setup like the one-before-largest colibre runs!!
Here, we improve upon this by doing the following:
- Construct an array of boolean (char) of the size of the top-level grid.
- The timestep_collect, sync, and limiter tasks when running at the top-level set the boolean to 'true' if they ended up changing anything related to the time-step in this cell
- We then all-reduce the array for all nodes.
- Each node then activates the tend comms involved in local cells for which the boolean is true.
This means we trade a lot of communications for a global reduction bottleneck.
On a COLIBRE L100N1504 running on 20 nodes (80 ranks, 32^3 TLCs), we see a 5-10% speed-up. In particular, all the steps involving very few particles are significantly faster (500+ms to 200ms). The impact will be larger at even higher resolution.