Skip to content
Snippets Groups Projects

Runtime limit and resubmission command

Merged Matthieu Schaller requested to merge resubmission_command into master
1 unresolved thread

Implements #461 (closed).

We create a new parameter to set the maximal wall-clock time of a given run. When the time is reached the code dumps a restart file and exits. An additional set of parameters allow the user to specify (or not) a command to be run as SWIFT exits. This is, for instance, a convenient way of having the code re-submit itself to the batch queue and resume the run without human intervention.

This uses the system() command and makes no check on what the user specified as a command. Do you see a better way of achieving this?

Merge request reports

Merged by avatar (May 30, 2025 5:15pm UTC)

Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
117 enable: 1 # (Optional) whether to enable dumping restarts at fixed intervals.
118 save: 1 # (Optional) whether to save copies of the previous set of restart files (named .prev)
119 onexit: 0 # (Optional) whether to dump restarts on exit (*needs enable*)
120 subdir: restart # (Optional) name of subdirectory for restart files.
121 basename: swift # (Optional) prefix used in naming restart files.
122 delta_hours: 6.0 # (Optional) decimal hours between dumps of restart files.
123 stop_steps: 100 # (Optional) how many steps to process before checking if the <subdir>/stop file exists. When present the application will attempt to exit early, dumping restart files first.
117 enable: 1 # (Optional) whether to enable dumping restarts at fixed intervals.
118 save: 1 # (Optional) whether to save copies of the previous set of restart files (named .prev)
119 onexit: 0 # (Optional) whether to dump restarts on exit (*needs enable*)
120 subdir: restart # (Optional) name of subdirectory for restart files.
121 basename: swift # (Optional) prefix used in naming restart files.
122 delta_hours: 6.0 # (Optional) decimal hours between dumps of restart files.
123 stop_steps: 100 # (Optional) how many steps to process before checking if the <subdir>/stop file exists. When present the application will attempt to exit early, dumping restart files first.
124 max_run_time: 24.0 # (optional) Maximal wall-clock time in hours. The application will exit when this limit is reached.
125 resubmit_on_exit: 0 # (Optional) whether to run a command when exiting after the time limit has been reached.
  • system() should be fine here, unless you want to handle any terminal output and just report a status, to avoid mess.

    In that case we would need to use popen() (or fork() and numerous dup()'s), or change the command so that the output was re-directed, it is just a shell command after all.

  • added 1 commit

    • 8c7c9241 - Read the resubmission command from the YAML file.

    Compare with previous version

  • There was indeed a missing commit+push...

    I do not think we need to worry about what the command might return. We do not want SWIFT to do anything about it anyway. My only worry is that we would have system() with un-verified input run on the nodes which admins may not be happy about it. But at the same time it is user space so there is only so much damage that can be done.

  • Indeed, there is nothing we could do about that. In principle the submission script can run any command, including re-submitting. I guess you don't want to just use that option because the state of the simulation is unknown, i.e this is a finished exit or need to resubmit one. You could imagine using a rendezvous file for that and avoid this code in SWIFT.

  • One other option is to have SWIFT return a different exit code in that situation that can be captured by the batch script. This will then act upon this result.

    I just feel that this will be too complicated for the average user...

  • Agreed about exit codes. I only expect to see them on error and handling them is probably beyond our average user.

  • We use a similar system in Gadget and I have seen other codes do the same thing. Apart from the rendezvous technique you mention I think this is the least bad options.

  • Was there anything to add here?

  • Haven't found a moment to test things, so not yet. Still uncomfortable calling system() in SWIFT, but needs must.

  • Ok. I am happy to remove it. I can use the batch scripts themselves to re-submit. The thing I need most is the runtime limit.

    Edited by Matthieu Schaller
  • It all seems to be working. Maybe we should ask the arbiter of taste (@nnrw56) if this is acceptable or not?

  • added 1 commit

    • 109cf456 - Add warning that command is a security concern, make sure output is not mixed…

    Compare with previous version

  • OK, so are accepting this. I've added a note about security and tried to make sure that the output does not mix in with that from the MPI process.

  • Peter W. Draper mentioned in commit 1be34da0

    mentioned in commit 1be34da0

  • Please register or sign in to reply
    Loading