Runtime limit and resubmission command
Implements #461 (closed).
We create a new parameter to set the maximal wall-clock time of a given run. When the time is reached the code dumps a restart file and exits. An additional set of parameters allow the user to specify (or not) a command to be run as SWIFT exits. This is, for instance, a convenient way of having the code re-submit itself to the batch queue and resume the run without human intervention.
This uses the system()
command and makes no check on what the user specified as a command. Do you see a better way of achieving this?
Merge request reports
Activity
117 enable: 1 # (Optional) whether to enable dumping restarts at fixed intervals. 118 save: 1 # (Optional) whether to save copies of the previous set of restart files (named .prev) 119 onexit: 0 # (Optional) whether to dump restarts on exit (*needs enable*) 120 subdir: restart # (Optional) name of subdirectory for restart files. 121 basename: swift # (Optional) prefix used in naming restart files. 122 delta_hours: 6.0 # (Optional) decimal hours between dumps of restart files. 123 stop_steps: 100 # (Optional) how many steps to process before checking if the <subdir>/stop file exists. When present the application will attempt to exit early, dumping restart files first. 117 enable: 1 # (Optional) whether to enable dumping restarts at fixed intervals. 118 save: 1 # (Optional) whether to save copies of the previous set of restart files (named .prev) 119 onexit: 0 # (Optional) whether to dump restarts on exit (*needs enable*) 120 subdir: restart # (Optional) name of subdirectory for restart files. 121 basename: swift # (Optional) prefix used in naming restart files. 122 delta_hours: 6.0 # (Optional) decimal hours between dumps of restart files. 123 stop_steps: 100 # (Optional) how many steps to process before checking if the <subdir>/stop file exists. When present the application will attempt to exit early, dumping restart files first. 124 max_run_time: 24.0 # (optional) Maximal wall-clock time in hours. The application will exit when this limit is reached. 125 resubmit_on_exit: 0 # (Optional) whether to run a command when exiting after the time limit has been reached. changed this line in version 2 of the diff
added 1 commit
- 8c7c9241 - Read the resubmission command from the YAML file.
There was indeed a missing commit+push...
I do not think we need to worry about what the command might return. We do not want SWIFT to do anything about it anyway. My only worry is that we would have
system()
with un-verified input run on the nodes which admins may not be happy about it. But at the same time it is user space so there is only so much damage that can be done.Indeed, there is nothing we could do about that. In principle the submission script can run any command, including re-submitting. I guess you don't want to just use that option because the state of the simulation is unknown, i.e this is a finished exit or need to resubmit one. You could imagine using a rendezvous file for that and avoid this code in SWIFT.
Ok. I am happy to remove it. I can use the batch scripts themselves to re-submit. The thing I need most is the runtime limit.
Edited by Matthieu SchallerIt all seems to be working. Maybe we should ask the arbiter of taste (@nnrw56) if this is acceptable or not?
added 1 commit
- 109cf456 - Add warning that command is a security concern, make sure output is not mixed…
mentioned in commit 1be34da0