Deadlock detector
I've been working on this as part of trying to solve issues with !1719 (merged) .
The idea is to allow swift to detect deadlocks. The scheduler keeps track of the time the last task was successfully fetched from a queue. If a user specified amount of time passes since the last successful fetch (tens of seconds, minutes...), swift assumes we're in a deadlock and aborts the mission.
I'm trying to use it to debug the issues in !1719 (merged). I thought this might be useful for others too, so here we are.
Marked as draft since it needs more testing on bigger examples, but it should be ready for a review.
I've prepared a second git branch with a ready-to-go example, where I manually tweaked cell_unskip.c to produce a deadlock.
How to run the example:
$ git checkout scheduler_timeout-proof-of-concept # note that this is not the same branch as in the MR
$ ./configure --with-rt=debug --with-stars=basic --with-feedback=none --enable-debugging-checks
$ make -j
$ cd examples/RadiativeTransferTests/UniformBox_3D
$ ./run.sh -m # -m flag executes with MPI
The deadlock should occur on step 3, after 10s of inactivity. The output should look something like this:
[0000] [00000.0] main: Running on 64000 gas particles, 0 sink particles, 8000 stars particles 0 black hole particles, 0 neutrino particles, and 0 DM particles (72000 gravity particles)
[0000] [00000.0] main: from t=0.000e+00 until t=9.537e-07 with 2 ranks, 4 threads / rank and 4 task queues / rank (dt_min=1.000e-12, dt_max=5.000e-08)...
[0000] [00000.1] engine_init_particles: Setting particles to a valid state...
[0000] [00000.1] engine_init_particles: Computing initial gas densities and approximate gravity.
[0000] [00000.1] space_rebuild: (re)building space
[0000] [00000.2] engine_init_particles: Converting internal energy variable.
[0000] [00000.2] engine_init_particles: Running initial fake time-step.
[0000] [00000.2] space_rebuild: (re)building space
0 cycle 0 (during regular tasks) dt= 7.450580e-09 min/max active bin= 1/56 rt_updates= 64000
0 cycle 1 time= 7.450580e-09 dt= 7.450580e-09 min/max active bin=49/49 rt_updates= 64000
0 cycle 2 time= 1.490116e-08 dt= 7.450580e-09 min/max active bin=49/50 rt_updates= 64000
0 cycle 3 time= 2.235174e-08 dt= 7.450580e-09 min/max active bin=49/49 rt_updates= 64000
[0000] [00000.6] engine_dump_snapshot: Dumping snapshot at t=0.000000e+00
[0000] [00000.6] engine_print_stats: Saving statistics at t=0.000000e+00
# Step Time Scale-factor Redshift Time-step Time-bins Updates g-Updates s-Updates sink-Updates b-Updates Wall-clock time [ms] Props Dead time [ms]
0 0.000000e+00 1.0000000 0.0000000 0.000000e+00 1 56 64000 72000 8000 0 0 526.688 27 34.313
[0000] [00000.6] space_rebuild: (re)building space
1 cycle 0 (during regular tasks) dt= 7.450580e-09 min/max active bin=49/51 rt_updates= 64000
1 cycle 1 time= 3.725290e-08 dt= 7.450580e-09 min/max active bin=49/49 rt_updates= 64000
1 cycle 2 time= 4.470348e-08 dt= 7.450580e-09 min/max active bin=49/50 rt_updates= 64000
1 cycle 3 time= 5.215406e-08 dt= 7.450580e-09 min/max active bin=49/49 rt_updates= 64000
[0000] [00001.0] engine_print_stats: Saving statistics at t=2.980232e-08
[0000] [00001.0] engine_dump_snapshot: Dumping snapshot at t=2.980232e-08
1 2.980232e-08 1.0000000 0.0000000 2.980232e-08 51 51 64000 72000 8000 0 0 414.871 25 33.973
2 cycle 0 (during regular tasks) dt= 7.450580e-09 min/max active bin=49/52 rt_updates= 64000
2 cycle 1 time= 6.705522e-08 dt= 7.450580e-09 min/max active bin=49/49 rt_updates= 64000
2 cycle 2 time= 7.450580e-08 dt= 7.450580e-09 min/max active bin=49/50 rt_updates= 64000
2 cycle 3 time= 8.195638e-08 dt= 7.450580e-09 min/max active bin=49/49 rt_updates= 64000
[0000] [00001.4] engine_print_stats: Saving statistics at t=5.960464e-08
[0000] [00001.4] engine_dump_snapshot: Dumping snapshot at t=5.960464e-08
2 5.960464e-08 1.0000000 0.0000000 2.980232e-08 51 52 64000 72000 8000 0 0 384.540 24 31.060
[0000] [00001.4] engine_repartition: repartitioning space
[0000] [00001.4] check_weights: checking vertex weight consistency
[0000] [00001.4] check_weights: checking edge weight consistency
[0000] [00001.4] check_weights: partition weights checked successfully
[0000] [00001.4] space_rebuild: (re)building space
3 cycle 0 (during regular tasks) dt= 7.450580e-09 min/max active bin=49/51 rt_updates= 64000
[0001] [00011.7] scheduler_abort_deadlock: Detected what looks like a deadlock after 10000 ms of no new task being fetched from queues. Dumping queues.
[0001] [00011.7] scheduler_abort_deadlock: Detected what looks like a deadlock after 10000 ms of no new task being fetched from queues. Dumping queues.
[0001] [00011.7] scheduler.c:scheduler_abort_deadlock():2901: Aborting now.
[0001] [00011.7] scheduler.c:scheduler_abort_deadlock():2901: Aborting now.
Abort(-1) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
Abort(-1) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
[0001] [00011.7] scheduler_abort_deadlock: Detected what looks like a deadlock after 10000.2 ms of no new task being fetched from queues. Dumping queues.
Merge request reports
Activity
requested review from @matthieu
added 2 commits
@pdraper you mentioned that the extra thread listening to the user could be used as a way to achieve something similar. The difference here is that it's somewhat automatic. Do you see a way to combine both approaches?
Not really, the thing I was pointing up is the range of things that are already being dumped in the dumper code:
https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/blob/master/src/engine_config.c#L65
We could turn that into a function and call from both approaches, but equally this could remove the need for the dumper, so we should strip out that.
added 23 commits
-
7457152d...4d649a57 - 15 commits from branch
master
- 858eea38 - initial backup commit
- 3867cb1f - seems to work
- 24ecc9ad - removed scheduler lock
- 86043d4b - applied code formatting tool
- b6bd4f6e - bit of cleanup
- cd835cce - added yml parameter for deadlock waiting time
- 72eb00a0 - revert cell_unskip.c to master version
- d0723485 - remove deprecated comment
Toggle commit list-
7457152d...4d649a57 - 15 commits from branch
assigned to @matthieu
added 1 commit
- d3ffd447 - unified output of dumper thread and deadlock detector
I think this is ready to go now.
@pdraper 's point to call the same diagnostics data dumper function has been realised. I also updated the
scheduler_timeout-proof-of-concept
branch if you want to see a demonstration.Currently running EAGLE_50 with 32 MPI ranks. Doesn't look like this would break things even when turned on correctly. I've made sure that unless it's activated via the parameter file, this should not break anything.