Tasks cleanup
Replace all tasks that are not part of the main computation to a threadpool
which can more simply and efficiently schedule small tasks with no dependencies. See #166 (closed).
Peter, can you check that this effectively passes all your tests? Thanks!
Merge request reports
Activity
MPI isn't working. The simple test:
> mpirun -np 2 ../swift_mpi -s -t 2 -v 2 sodShock.yml
fails with:
2 2.442002e-04 5.960464e-08 821962 0 2180.446 [0000] [00008.3] space_regrid: h_max is inf (cell_max=0.000e+00). [0001] [00008.3] space_regrid: h_max is inf (cell_max=0.000e+00). [0000] [00008.3] space.c:space_regrid():204: Must have at least 3 cells in each spatial dimension when periodicity is switched on.
Running this several times gives different results each time (
h_max
is not alwaysinf
, usually just quite large).Looks like the issue was introduced by commit 22808bfa. That seems innocuous, but once in place the time-steps no longer seem consistent:
# Step Time Time-step Updates g-Updates Wall-clock time [ms] 0 0.000000e+00 0.000000e+00 0 0 1595.127 1 2.441406e-04 2.441406e-04 1024128 0 2097.195 2 4.882812e-04 2.441406e-04 821962 0 1949.297 3 7.324219e-04 2.441406e-04 1024128 0 2315.323 4 9.765625e-04 2.441406e-04 825004 0 1190.018 5 1.220703e-03 2.441406e-04 1024128 0 1947.241 6 1.220763e-03 5.960464e-08 828121 0 1965.682 7 1.220822e-03 5.960464e-08 10 0 542.785 8 1.220942e-03 1.192093e-07 16 0 114.159 9 1.221180e-03 2.384186e-07 22 0 121.863 10 1.221657e-03 4.768372e-07 28 0 118.910 11 1.222610e-03 9.536743e-07 29 0 116.603 12 1.224518e-03 1.907349e-06 37 0 112.650 13 1.228333e-03 3.814697e-06 39 0 135.359 14 1.235962e-03 7.629395e-06 43 0 145.559 15 1.251221e-03 1.525879e-05 44 0 132.389 16 1.281738e-03 3.051758e-05 45 0 117.968 17 1.342773e-03 6.103516e-05 45 0 124.367 18 1.464844e-03 1.220703e-04 46 0 120.631 19 1.464903e-03 5.960464e-08 1024128 0 1988.082 [0001] [00019.5] runner_do_ghost: Smoothing length failed to converge on 18 particles. 20 1.464963e-03 5.960464e-08 35 0 565.134
compared to same test and code with that commit reverted:
# Step Time Time-step Updates g-Updates Wall-clock time [ms] 0 0.000000e+00 0.000000e+00 0 0 1502.746 1 2.441406e-04 2.441406e-04 1024128 0 1926.878 2 4.882812e-04 2.441406e-04 821962 0 1747.267 3 7.324219e-04 2.441406e-04 1024128 0 1517.797 4 9.765625e-04 2.441406e-04 825004 0 1731.234 5 1.220703e-03 2.441406e-04 1024128 0 2462.824 6 1.464844e-03 2.441406e-04 828121 0 1573.883 7 1.586914e-03 1.220703e-04 1024128 0 2991.975 8 1.708984e-03 1.220703e-04 3 0 1216.576 9 1.831055e-03 1.220703e-04 828802 0 570.419 10 1.953125e-03 1.220703e-04 20 0 1802.887 11 2.075195e-03 1.220703e-04 1024128 0 2093.284 12 2.197266e-03 1.220703e-04 106 0 890.479 13 2.319336e-03 1.220703e-04 829390 0 579.430 14 2.441406e-03 1.220703e-04 241 0 1702.580 15 2.563477e-03 1.220703e-04 1024128 0 1312.131 16 2.685547e-03 1.220703e-04 319 0 929.371 17 2.807617e-03 1.220703e-04 829887 0 489.350 18 2.929688e-03 1.220703e-04 344 0 1943.941 19 3.051758e-03 1.220703e-04 1024128 0 1541.485 20 3.173828e-03 1.220703e-04 343 0 751.056
which looks much like the current master. Could the ordering of the tasks be important, particularly with respect to the proxy cells? Ordering looks like the only real change.
Added 1 commit:
- 997e3344 - update links atomically, makes this thread-safe without explicit locking.
OK, this should be fixed now! The problem was unsafe link updates, which is a bit embarrassing as I had noticed the problem in
engine_count_and_link_tasks_mapper
, but had failed to fix similarly avoid it inengine_make_extra_hydroloop_tasks_mapper
.Fixed it now the right way by making the operation atomic and not relying on explicitly locking a cell.
Added 9 commits:
-
8fb1bdf0...778c4cca - 8 commits from branch
master
- eee8f5e0 - Merge remote-tracking branch 'origin' into tasks_cleanup
-
8fb1bdf0...778c4cca - 8 commits from branch
Thanks, that seems to have cured that issue.
There must be other unsafe accesses in other code as I've seen a couple of random crashes:
19 1.855469e-06 9.765625e-08 2877 0 153.244 [00017.1] scheduler.c:scheduler_done():1160: Negative wait! ./run.sh: line 10: 25556 Aborted (core dumped) ../swift -s -t 16 cosmoVolume.yml
and
38 2.968750e-06 7.812500e-08 703 0 140.305 [00023.0] runner_doiact.h:runner_dopair1_density():757: Trying to interact unsorted cells.
Both running the standard
run.sh
script on the CosmoVolume. Now I've seen these I've tried to reproduce a fail with a looping test, but to no result so far, it just continues to run. Hmm.Added 1 commit:
- ce3782b9 - made engine_marktasks serial again for the non-fixdt case.
You may see it, if lucky. I ran 300 loops of 40 steps on the CosmoVolume and saw it twice.
Meanwhile I found a bug in master from an earlier commit (!179 (merged) now fixed) cannot rule out it being involved, so I'll run that loop again.
Added 1 commit:
- 4b872bc1 - clean-up scheduler_start, launch the zero-rank tasks with the threadpool.
Added 1 commit:
- 13ff2fb8 - use chunked mappers in the threadpool, make sure that the map function actually …