Add restart facility
We need to be able to restart SWIFT for running on time limited queues and to deal with machine downtimes.
The basics of this approach are that:
-
the engine struct references all the data that we need
-
we will only allow restarts on the same architecture with the same numbers of MPI ranks, threads and parameters
-
information that can be reconstructed should not be saved (cells/tasks).
To support this we now have various dump/restore functions that manage their structs as necessary (these are those that are allocated, rather than are statically referenced). It is assumed that specialist knowledge is held in these functions. All these are called, recursively as necessary to write a binary stream for each rank.
This is fragile in that the order of the dump sequence cannot be changed and new struct members or struct reorganization will break the restart. The necessary work to recover and restart looks a little piecemeal, i.e. it works and is not designed.
The dump format is a byte stream with sections for each struct. These are started with a header block consisting of the length of the data and a label (20 chars) followed by the data itself. The lengths of the data are checked against the sizes of the current structs, but the labels are just symbolic and may be used in the future.
Control of restart files is provided by the parameters:
Restarts:
enable: 1 # (Optional) whether to enable dumping restarts at fixed intervals.
onexit: 0 # (Optional) whether to dump restarts on exit (*needs enable*)
subdir: restart # (Optional) name of subdirectory for restart files.
basename: swift # (Optional) prefix used in naming restart files.
delta_hours: 6.0 # (Optional) decimal hours between dumps of restart files.
stop_steps: 100 # (Optional) how many steps to process before checking if the <subdir>/stop file exists.
When present the application will attempt to exit early, dumping restart files first.
and is activated by the flag -r
(most command-line options are ignored, but
the parameter file is still required, that provides where to find the restart
files). Output files are appended to on restart, this will not always make
sense as the restart can happen sometime before the time when the application exits.
SWIFT can now be stopped by touching the file stop
in the restarts directory.
This will dump the restart files, regardless of the onexit setting, although
if enable is false, that will be respected. The presence of this file is only
checking every stop_steps
steps (currently every 100).
Jobs to do:
- keep backup copies of the restart files
- decide how to poll the stop file.
Merge request reports
Activity
Currently failing on some restarts with self-gravity and stars on the next repartition. Command:
mpirun -np 4 ../swift_mpi -G -S -s -t 8 -n 32 -PDomainDecomposition:trigger:5 eagle_6.yml
followed by:
mpirun -np 4 ../swift_mpi -G -S -s -t 8 -n 64 -PDomainDecomposition:trigger:5 eagle_6.yml -r
added 1 commit
- a9b7c600 - Need to make sure xparts are not freed when not used as well
added 1 commit
- 271c51d3 - We cannot derive the estimated tasks per cell on restart so save the old value and reuse that
added 1 commit
- 9a6d9757 - Remove debugging output, tidy up output and make sure we return restart files in…
added 1 commit
- cfa0968d - Sugar of various kinds, new parameters to control restart file dumping, better…
I will have a more complete look over the weekend but it looks like it is doing the job.
From the space, you only dump the part/xpart/gpart/spart arrays. Is that enough?
To stop the application, many codes I have used scan the directory where they are running and if they spot an (empty) file called "stop" they then proceed with writing a check-point and exit.
added 1 commit
- dc0aaaa6 - Add check for <Restarts:subdir>/stop file every no. of steps.
added 1 commit
- 1ba3366a - Store the length of the packet written into the stream and add a label
added 1 commit
- 0ae5c4b3 - Allow parameters controlling restarts to be overridden by the runtime parameters
added 6 commits
-
0ae5c4b3...efd4db84 - 5 commits from branch
master
- 93542fc8 - Merge remote-tracking branch 'origin/master' into restart-structs
-
0ae5c4b3...efd4db84 - 5 commits from branch
added 1 commit
- 45945643 - Save the snapshot counter so we restart correctly
I think this is good enough for someone else to test now, so re-assigning to @matthieu.
assigned to @matthieu
added 1 commit
- 464b2120 - Add end marker to restart files, should make it possible to check for truncation
I don't know of any asynchronous calls to stat()-like functions, so the only way to do that would be to use a thread to do the polling (I guess you could make that lightweight by checking the content of the file, rather than its existence, that could probably be based on a select()). Sounds like a lot of effort.
OK, that makes sense, we want to mainline this code as soon as possible.
@matthieu go ahead and merge this. Please leave the branch for updates.
mentioned in issue #399 (closed)