Code exits during snapshot writing over MPI
My runs of 50 Mpc EAGLE boxes over MPI (2 ranks, 28 threads each) are consistently crashing when writing a snapshot. This is with code version e51bb143, and standard configuration options (tbbmalloc, ipo, SPHENIX hydro, EAGLE subgrid, Wendland-C2 kernel).
This is a development code version that can write multiple levels of outputs (here two: full snapshots, and snipshots containing only black hole data). But this feature works fine without MPI, and the crash primarily happens when writing a full snapshot. The simulation has reached z ~ 15 at this point, and has written 12 snapshots already (both full and reduced), no black holes have formed yet. So I am hesitant to blame this on the snipshot aspect.
When crashing, the code does not write any entry to the error log file, just a short system message in the standard output log saying "Bad termination of one of your application processes, Exit Code: 9" (this is on Cosma7).
The crash is fully reproducible when restarting the simulation from the last-written restart file (dumped around 1 hr before the crash). However, when changing the restart dump frequency to write a restart file closer to the crash point, and then restarting from that new restart file, it writes the snapshot on which it crashed originally without a problem -- only to then crash at another snapshot, again after roughly 1 hr of run time.
An example (on Cosma) is in directory /cosma7/data/dp004/dc-bahe1/EXL/ID162F/
(which started from what are now the *.prev restart files, which were copied from the full run at /cosma7/data/dp004/dc-bahe1/EXL/ID162_E50_C4/
). The continuation run, started from the last restart files in .../ID162F/restart/
, and crashing two snapshots further, is in /cosma7/data/dp004/dc-bahe1/EXL/ID162F_restart/
.