We need to be able to restart SWIFT for running on time limited queues and to deal with machine downtimes.
The basics of this approach are that:
the engine struct references all the data that we need
we will only allow restarts on the same architecture with the same numbers of MPI ranks, threads and parameters
information that can be reconstructed should not be saved (cells/tasks).
To support this we now have various dump/restore functions that manage their structs as necessary (these are those that are allocated, rather than are statically referenced). It is assumed that specialist knowledge is held in these functions. All these are called, recursively as necessary to write a binary stream for each rank.
This is fragile in that the order of the dump sequence cannot be changed and new struct members or struct reorganization will break the restart. The necessary work to recover and restart looks a little piecemeal, i.e. it works and is not designed.
The dump format is a byte stream with sections for each struct. These are started with a header block consisting of the length of the data and a label (20 chars) followed by the data itself. The lengths of the data are checked against the sizes of the current structs, but the labels are just symbolic and may be used in the future.
Control of restart files is provided by the parameters:
Restarts: enable: 1 # (Optional) whether to enable dumping restarts at fixed intervals. onexit: 0 # (Optional) whether to dump restarts on exit (*needs enable*) subdir: restart # (Optional) name of subdirectory for restart files. basename: swift # (Optional) prefix used in naming restart files. delta_hours: 6.0 # (Optional) decimal hours between dumps of restart files. stop_steps: 100 # (Optional) how many steps to process before checking if the <subdir>/stop file exists. When present the application will attempt to exit early, dumping restart files first.
and is activated by the flag
-r (most command-line options are ignored, but
the parameter file is still required, that provides where to find the restart
files). Output files are appended to on restart, this will not always make
sense as the restart can happen sometime before the time when the application exits.
SWIFT can now be stopped by touching the file
stop in the restarts directory.
This will dump the restart files, regardless of the onexit setting, although
if enable is false, that will be respected. The presence of this file is only
stop_steps steps (currently every 100).
Jobs to do:
- keep backup copies of the restart files
- decide how to poll the stop file.