Skip to content

Deadlock detector

Mladen Ivkovic requested to merge scheduler_timeout into master

I've been working on this as part of trying to solve issues with !1719 (merged) .

The idea is to allow swift to detect deadlocks. The scheduler keeps track of the time the last task was successfully fetched from a queue. If a user specified amount of time passes since the last successful fetch (tens of seconds, minutes...), swift assumes we're in a deadlock and aborts the mission.

I'm trying to use it to debug the issues in !1719 (merged). I thought this might be useful for others too, so here we are.

Marked as draft since it needs more testing on bigger examples, but it should be ready for a review.

I've prepared a second git branch with a ready-to-go example, where I manually tweaked cell_unskip.c to produce a deadlock.

How to run the example:

$ git checkout scheduler_timeout-proof-of-concept # note that this is not the same branch as in the MR
$ ./configure --with-rt=debug --with-stars=basic --with-feedback=none --enable-debugging-checks
$ make -j
$ cd examples/RadiativeTransferTests/UniformBox_3D
$ ./run.sh -m  # -m flag executes with MPI

The deadlock should occur on step 3, after 10s of inactivity. The output should look something like this:

[0000] [00000.0] main: Running on 64000 gas particles, 0 sink particles, 8000 stars particles 0 black hole particles, 0 neutrino particles, and 0 DM particles (72000 gravity particles)
[0000] [00000.0] main: from t=0.000e+00 until t=9.537e-07 with 2 ranks, 4 threads / rank and 4 task queues / rank (dt_min=1.000e-12, dt_max=5.000e-08)...
[0000] [00000.1] engine_init_particles: Setting particles to a valid state...
[0000] [00000.1] engine_init_particles: Computing initial gas densities and approximate gravity.
[0000] [00000.1] space_rebuild: (re)building space
[0000] [00000.2] engine_init_particles: Converting internal energy variable.
[0000] [00000.2] engine_init_particles: Running initial fake time-step.
[0000] [00000.2] space_rebuild: (re)building space
       0 cycle   0 (during regular tasks) dt=  7.450580e-09 min/max active bin= 1/56 rt_updates=             64000
       0 cycle   1 time= 7.450580e-09     dt=  7.450580e-09 min/max active bin=49/49 rt_updates=             64000
       0 cycle   2 time= 1.490116e-08     dt=  7.450580e-09 min/max active bin=49/50 rt_updates=             64000
       0 cycle   3 time= 2.235174e-08     dt=  7.450580e-09 min/max active bin=49/49 rt_updates=             64000
[0000] [00000.6] engine_dump_snapshot: Dumping snapshot at t=0.000000e+00
[0000] [00000.6] engine_print_stats: Saving statistics at t=0.000000e+00
#   Step           Time Scale-factor     Redshift      Time-step Time-bins      Updates    g-Updates    s-Updates sink-Updates    b-Updates  Wall-clock time [ms]  Props    Dead time [ms] 
       0   0.000000e+00    1.0000000    0.0000000   0.000000e+00    1   56        64000        72000         8000            0            0               526.688     27            34.313
[0000] [00000.6] space_rebuild: (re)building space
       1 cycle   0 (during regular tasks) dt=  7.450580e-09 min/max active bin=49/51 rt_updates=             64000
       1 cycle   1 time= 3.725290e-08     dt=  7.450580e-09 min/max active bin=49/49 rt_updates=             64000
       1 cycle   2 time= 4.470348e-08     dt=  7.450580e-09 min/max active bin=49/50 rt_updates=             64000
       1 cycle   3 time= 5.215406e-08     dt=  7.450580e-09 min/max active bin=49/49 rt_updates=             64000
[0000] [00001.0] engine_print_stats: Saving statistics at t=2.980232e-08
[0000] [00001.0] engine_dump_snapshot: Dumping snapshot at t=2.980232e-08
       1   2.980232e-08    1.0000000    0.0000000   2.980232e-08   51   51        64000        72000         8000            0            0               414.871     25            33.973
       2 cycle   0 (during regular tasks) dt=  7.450580e-09 min/max active bin=49/52 rt_updates=             64000
       2 cycle   1 time= 6.705522e-08     dt=  7.450580e-09 min/max active bin=49/49 rt_updates=             64000
       2 cycle   2 time= 7.450580e-08     dt=  7.450580e-09 min/max active bin=49/50 rt_updates=             64000
       2 cycle   3 time= 8.195638e-08     dt=  7.450580e-09 min/max active bin=49/49 rt_updates=             64000
[0000] [00001.4] engine_print_stats: Saving statistics at t=5.960464e-08
[0000] [00001.4] engine_dump_snapshot: Dumping snapshot at t=5.960464e-08
       2   5.960464e-08    1.0000000    0.0000000   2.980232e-08   51   52        64000        72000         8000            0            0               384.540     24            31.060
[0000] [00001.4] engine_repartition: repartitioning space
[0000] [00001.4] check_weights: checking vertex weight consistency
[0000] [00001.4] check_weights: checking edge weight consistency
[0000] [00001.4] check_weights: partition weights checked successfully
[0000] [00001.4] space_rebuild: (re)building space
       3 cycle   0 (during regular tasks) dt=  7.450580e-09 min/max active bin=49/51 rt_updates=             64000
[0001] [00011.7] scheduler_abort_deadlock: Detected what looks like a deadlock after 10000 ms of no new task being fetched from queues. Dumping queues.
[0001] [00011.7] scheduler_abort_deadlock: Detected what looks like a deadlock after 10000 ms of no new task being fetched from queues. Dumping queues.
[0001] [00011.7] scheduler.c:scheduler_abort_deadlock():2901: Aborting now.
[0001] [00011.7] scheduler.c:scheduler_abort_deadlock():2901: Aborting now.
Abort(-1) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
Abort(-1) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
[0001] [00011.7] scheduler_abort_deadlock: Detected what looks like a deadlock after 10000.2 ms of no new task being fetched from queues. Dumping queues.

Merge request reports