Tasks cleanup

MPI isn't working. The simple test:

> mpirun -np 2 ../swift_mpi -s -t 2 -v 2 sodShock.yml

fails with:

       2   2.442002e-04   5.960464e-08     821962          0              2180.446
[0000] [00008.3] space_regrid: h_max is inf (cell_max=0.000e+00).
[0001] [00008.3] space_regrid: h_max is inf (cell_max=0.000e+00).
[0000] [00008.3] space.c:space_regrid():204: Must have at least 3 cells in each spatial dimension when periodicity is switched on.

Running this several times gives different results each time (h_max is not always inf, usually just quite large).

Title changed from Tasks cleanup to WIP: Tasks cleanup

Looks like the issue was introduced by commit 22808bfa. That seems innocuous, but once in place the time-steps no longer seem consistent:

#   Step           Time      Time-step    Updates  g-Updates  Wall-clock time [ms]
       0   0.000000e+00   0.000000e+00          0          0              1595.127
       1   2.441406e-04   2.441406e-04    1024128          0              2097.195
       2   4.882812e-04   2.441406e-04     821962          0              1949.297
       3   7.324219e-04   2.441406e-04    1024128          0              2315.323
       4   9.765625e-04   2.441406e-04     825004          0              1190.018
       5   1.220703e-03   2.441406e-04    1024128          0              1947.241
       6   1.220763e-03   5.960464e-08     828121          0              1965.682
       7   1.220822e-03   5.960464e-08         10          0               542.785
       8   1.220942e-03   1.192093e-07         16          0               114.159
       9   1.221180e-03   2.384186e-07         22          0               121.863
      10   1.221657e-03   4.768372e-07         28          0               118.910
      11   1.222610e-03   9.536743e-07         29          0               116.603
      12   1.224518e-03   1.907349e-06         37          0               112.650
      13   1.228333e-03   3.814697e-06         39          0               135.359
      14   1.235962e-03   7.629395e-06         43          0               145.559
      15   1.251221e-03   1.525879e-05         44          0               132.389
      16   1.281738e-03   3.051758e-05         45          0               117.968
      17   1.342773e-03   6.103516e-05         45          0               124.367
      18   1.464844e-03   1.220703e-04         46          0               120.631
      19   1.464903e-03   5.960464e-08    1024128          0              1988.082
[0001] [00019.5] runner_do_ghost: Smoothing length failed to converge on 18 particles.
      20   1.464963e-03   5.960464e-08         35          0               565.134

compared to same test and code with that commit reverted:

#   Step           Time      Time-step    Updates  g-Updates  Wall-clock time [ms]
       0   0.000000e+00   0.000000e+00          0          0              1502.746
       1   2.441406e-04   2.441406e-04    1024128          0              1926.878
       2   4.882812e-04   2.441406e-04     821962          0              1747.267
       3   7.324219e-04   2.441406e-04    1024128          0              1517.797
       4   9.765625e-04   2.441406e-04     825004          0              1731.234
       5   1.220703e-03   2.441406e-04    1024128          0              2462.824
       6   1.464844e-03   2.441406e-04     828121          0              1573.883
       7   1.586914e-03   1.220703e-04    1024128          0              2991.975
       8   1.708984e-03   1.220703e-04          3          0              1216.576
       9   1.831055e-03   1.220703e-04     828802          0               570.419
      10   1.953125e-03   1.220703e-04         20          0              1802.887
      11   2.075195e-03   1.220703e-04    1024128          0              2093.284
      12   2.197266e-03   1.220703e-04        106          0               890.479
      13   2.319336e-03   1.220703e-04     829390          0               579.430
      14   2.441406e-03   1.220703e-04        241          0              1702.580
      15   2.563477e-03   1.220703e-04    1024128          0              1312.131
      16   2.685547e-03   1.220703e-04        319          0               929.371
      17   2.807617e-03   1.220703e-04     829887          0               489.350
      18   2.929688e-03   1.220703e-04        344          0              1943.941
      19   3.051758e-03   1.220703e-04    1024128          0              1541.485
      20   3.173828e-03   1.220703e-04        343          0               751.056

which looks much like the current master. Could the ordering of the tasks be important, particularly with respect to the proxy cells? Ordering looks like the only real change.

Thanks for narrowing this down! I'll have a closer look tonight, there's probably some implicit dependency I messed-up somewhere.

Added 2 commits:

ca912a0c - re-instate verbosity in engine_marktasks.
8cdd5178 - do not scheduler communication tasks directly in scheduler_start.

Added 1 commit:

997e3344 - update links atomically, makes this thread-safe without explicit locking.

Added 2 commits:

ab7ffe55 - undo skipping the comm tasks, something is amiss...
8fb1bdf0 - fix missing task name, lost during merge with master.

OK, this should be fixed now! The problem was unsafe link updates, which is a bit embarrassing as I had noticed the problem in engine_count_and_link_tasks_mapper, but had failed to fix similarly avoid it in engine_make_extra_hydroloop_tasks_mapper.

Fixed it now the right way by making the operation atomic and not relying on explicitly locking a cell.

Added 9 commits:

8fb1bdf0...778c4cca - 8 commits from branch master
eee8f5e0 - Merge remote-tracking branch 'origin' into tasks_cleanup

Thanks, that seems to have cured that issue.

There must be other unsafe accesses in other code as I've seen a couple of random crashes:

      19   1.855469e-06   9.765625e-08       2877          0               153.244
[00017.1] scheduler.c:scheduler_done():1160: Negative wait!
./run.sh: line 10: 25556 Aborted                 (core dumped) ../swift -s -t 16 cosmoVolume.yml

and

      38   2.968750e-06   7.812500e-08        703          0               140.305
[00023.0] runner_doiact.h:runner_dopair1_density():757: Trying to interact unsorted cells.

Both running the standard run.sh script on the CosmoVolume. Now I've seen these I've tried to reproduce a fail with a looping test, but to no result so far, it just continues to run. Hmm.

Oh, not good. Those are seemingly unrelated failures. I'll see if I can reproduce them locally and see what happens.

Added 1 commit:

ce3782b9 - made engine_marktasks serial again for the non-fixdt case.

You may see it, if lucky. I ran 300 loops of 40 steps on the CosmoVolume and saw it twice.

Meanwhile I found a bug in master from an earlier commit (!179 (merged) now fixed) cannot rule out it being involved, so I'll run that loop again.

Added 1 commit:

4b872bc1 - clean-up scheduler_start, launch the zero-rank tasks with the threadpool.

OK, re-worked some of scheduler_start, to make sure we're not dropping any tasks anywhere. Any luck making it fail again?

Not so far, at run 260 and it's now a no-show. I'll leave it running overnight for good measure.

Sod's law. Just as I wrote the "unsorted cells" error finally popped up. I'll rebuild with your latest commit and start it again.

OK, I managed to reproduce the crash (negative wait) by running with way too many threads (16 on my 2-core laptop). Trying to track it down.

Good, several negative waits came up for me overnight, so definitely still an issue.

Added 1 commit:

13ff2fb8 - use chunked mappers in the threadpool, make sure that the map function actually …

Tasks cleanup

Merge request reports

Activity