WIP: Add handler for USR1 signal that attempts to dump the active tasks and memory…
Add handler for USR1 signal that attempts to dump the active tasks and memory logger (if enabled) and exits the applications
Attempt to provide some useful information about internal state when deadlocked.
Signaling a SLURM batch job using scancel -s USR1 <jobid>
works (but not reliably)
See !859 (merged) for a more reliable solution.
Merge request reports
Activity
James tried playing around with this sometime ago (when we had a deadlock on stampede) and it turned out to be tricky to reliably trap signals from the batch system, like
SIGTERM
, so I'm just attempting to avoid any cleverness and keeping away from signals that other systems might decide to mediate.We'd still need to use
scancel -s INT <jobid>
, but I take the point thatSIGINT
is probably better, as in Control-C. Sadly this code has bigger issues as it failed to work on a deadlocked job I'm seeing!added 1 commit
- 4b3b4a7a - Only use one MPI_Barrier per process, that is the size of our world
Turns out that this method is never going to work reliably for MPI tasks. Having the batch system, the script running the task and the MPI launcher all in play gives too many actors that can and do intercept the signal. Attempting a simpler mechanism in !859 (merged) and closing this request.
mentioned in merge request !859 (merged)