Avoid very unbalanced memory loads
Some of the most commonly used reparitioning schemes (e.g. fullcosts
) do not directly consider the memory load per rank. They do so indirectly, since the task load should scale with the number of particles and hence the memory load. But if this indirect link is somehow not working, which seems to be possible (see background to #801 (closed)), then a repartitioning could lead to severe memory imbalances that in the worst case scenario lead to restart files that are so unbalanced that the run can no longer be restarted on the same number of ranks.
It would be good to have a contingency plan for when this happens. A few ideas were discussed during the telecon today:
- Check the memory balance of a proposed new partitioning and reject it if this is too unbalanced. We can either try again and hope the next proposed partitioning will be more balanced, not repartition at all and use the old partitioning, or use a combination of both (e.g. try twice, then give up and keep the old one).
- Check the balance of the restart dumps that are created. If those become too unbalanced, we know something weird is happening with the system, and we try to take appropriate measures.
- Play it safe and abort the run if anything weird happens in the repartitioning or while writing the restart files.
I think the bottom line of the discussion was that we need to avoid a scenario where the run can no longer be restarted. So either we need to prevent memory imbalances somehow, or we need to make sure that the run aborts without overwriting its last usable restart dumps. Aborting would generally be the safest bet, since manual intervention is usually a lot smarter than any automatic balancing scheme we can come up with.