In large simulations, especially with small time-steps, we swamp the system with tend
communications at the end of a step.
That is because in the current logic we don't have a good way of deciding which ones to launch (because of the complexities of the sync + limiter) and so decided to activate all of them. That can be N^3 * 125 communications, where N is the number of top-level cells on one side. That's 4e6 comms in a 32^3 setup like the one-before-largest colibre runs!!
Here, we improve upon this by doing the following:
This means we trade a lot of communications for a global reduction bottleneck.
On a COLIBRE L100N1504 running on 20 nodes (80 ranks, 32^3 TLCs), we see a 5-10% speed-up. In particular, all the steps involving very few particles are significantly faster (500+ms to 200ms). The impact will be larger at even higher resolution.