Repartition parameter changes ignored upon restart
Background: one of the FLAMINGO runs somehow ended up being terribly imbalanced in memory, to the point that some ranks contained 100x more particles than others. Unfortunately, two generations of restart files were written between the last repartition and the point where it ran out of memory and crashed, so that we were left with two sets of imbalanced restart files that inevitably crash during the next largish step upon restart. While we ended up successfully resuming the run by doubling the number of nodes, I am trying to figure out ways to recover usable restart files on the original number of nodes, in case this ever happens again for a run where we can't just double the number of nodes.
Turns out that you cannot actually change the parameters that govern the repartition upon restart. While main.c
does call partition_init()
and even writes a message informing the user of the chosen repartition scheme (as read from the parameter file), engine_struct_restore()
later allocates a new repartition
struct that is restored from the restart files, and hence contains the original parameters.
This is on the one hand unfortunate, since it means there is no way to change the repartition scheme upon restart. It is also misleading, since the output makes it seem as if this is possible. It could also mean that the repartition scheme as presented in the output is different from what is actually used, which complicates debugging.
Ideally, it should be possible to change the repartition parameters upon restart. If this turns out to be too complex, then at least we should make sure the standard output represents the actual repartition scheme that is used.