Skip to content
Snippets Groups Projects

Tasks cleanup

Merged Pedro Gonnet requested to merge tasks_cleanup into master

Replace all tasks that are not part of the main computation to a threadpool which can more simply and efficiently schedule small tasks with no dependencies. See #166 (closed).

Peter, can you check that this effectively passes all your tests? Thanks!

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • MPI isn't working. The simple test:

    > mpirun -np 2 ../swift_mpi -s -t 2 -v 2 sodShock.yml

    fails with:

           2   2.442002e-04   5.960464e-08     821962          0              2180.446
    [0000] [00008.3] space_regrid: h_max is inf (cell_max=0.000e+00).
    [0001] [00008.3] space_regrid: h_max is inf (cell_max=0.000e+00).
    [0000] [00008.3] space.c:space_regrid():204: Must have at least 3 cells in each spatial dimension when periodicity is switched on.

    Running this several times gives different results each time (h_max is not always inf, usually just quite large).

  • Peter W. Draper Title changed from Tasks cleanup to WIP: Tasks cleanup

    Title changed from Tasks cleanup to WIP: Tasks cleanup

  • Looks like the issue was introduced by commit 22808bfa. That seems innocuous, but once in place the time-steps no longer seem consistent:

    #   Step           Time      Time-step    Updates  g-Updates  Wall-clock time [ms]
           0   0.000000e+00   0.000000e+00          0          0              1595.127
           1   2.441406e-04   2.441406e-04    1024128          0              2097.195
           2   4.882812e-04   2.441406e-04     821962          0              1949.297
           3   7.324219e-04   2.441406e-04    1024128          0              2315.323
           4   9.765625e-04   2.441406e-04     825004          0              1190.018
           5   1.220703e-03   2.441406e-04    1024128          0              1947.241
           6   1.220763e-03   5.960464e-08     828121          0              1965.682
           7   1.220822e-03   5.960464e-08         10          0               542.785
           8   1.220942e-03   1.192093e-07         16          0               114.159
           9   1.221180e-03   2.384186e-07         22          0               121.863
          10   1.221657e-03   4.768372e-07         28          0               118.910
          11   1.222610e-03   9.536743e-07         29          0               116.603
          12   1.224518e-03   1.907349e-06         37          0               112.650
          13   1.228333e-03   3.814697e-06         39          0               135.359
          14   1.235962e-03   7.629395e-06         43          0               145.559
          15   1.251221e-03   1.525879e-05         44          0               132.389
          16   1.281738e-03   3.051758e-05         45          0               117.968
          17   1.342773e-03   6.103516e-05         45          0               124.367
          18   1.464844e-03   1.220703e-04         46          0               120.631
          19   1.464903e-03   5.960464e-08    1024128          0              1988.082
    [0001] [00019.5] runner_do_ghost: Smoothing length failed to converge on 18 particles.
          20   1.464963e-03   5.960464e-08         35          0               565.134

    compared to same test and code with that commit reverted:

    #   Step           Time      Time-step    Updates  g-Updates  Wall-clock time [ms]
           0   0.000000e+00   0.000000e+00          0          0              1502.746
           1   2.441406e-04   2.441406e-04    1024128          0              1926.878
           2   4.882812e-04   2.441406e-04     821962          0              1747.267
           3   7.324219e-04   2.441406e-04    1024128          0              1517.797
           4   9.765625e-04   2.441406e-04     825004          0              1731.234
           5   1.220703e-03   2.441406e-04    1024128          0              2462.824
           6   1.464844e-03   2.441406e-04     828121          0              1573.883
           7   1.586914e-03   1.220703e-04    1024128          0              2991.975
           8   1.708984e-03   1.220703e-04          3          0              1216.576
           9   1.831055e-03   1.220703e-04     828802          0               570.419
          10   1.953125e-03   1.220703e-04         20          0              1802.887
          11   2.075195e-03   1.220703e-04    1024128          0              2093.284
          12   2.197266e-03   1.220703e-04        106          0               890.479
          13   2.319336e-03   1.220703e-04     829390          0               579.430
          14   2.441406e-03   1.220703e-04        241          0              1702.580
          15   2.563477e-03   1.220703e-04    1024128          0              1312.131
          16   2.685547e-03   1.220703e-04        319          0               929.371
          17   2.807617e-03   1.220703e-04     829887          0               489.350
          18   2.929688e-03   1.220703e-04        344          0              1943.941
          19   3.051758e-03   1.220703e-04    1024128          0              1541.485
          20   3.173828e-03   1.220703e-04        343          0               751.056

    which looks much like the current master. Could the ordering of the tasks be important, particularly with respect to the proxy cells? Ordering looks like the only real change.

  • Author Developer

    Thanks for narrowing this down! I'll have a closer look tonight, there's probably some implicit dependency I messed-up somewhere.

  • Pedro Gonnet Added 2 commits:

    Added 2 commits:

    • ca912a0c - re-instate verbosity in engine_marktasks.
    • 8cdd5178 - do not scheduler communication tasks directly in scheduler_start.
  • Pedro Gonnet Added 1 commit:

    Added 1 commit:

    • 997e3344 - update links atomically, makes this thread-safe without explicit locking.
  • Pedro Gonnet Added 2 commits:

    Added 2 commits:

    • ab7ffe55 - undo skipping the comm tasks, something is amiss...
    • 8fb1bdf0 - fix missing task name, lost during merge with master.
  • Author Developer

    OK, this should be fixed now! The problem was unsafe link updates, which is a bit embarrassing as I had noticed the problem in engine_count_and_link_tasks_mapper, but had failed to fix similarly avoid it in engine_make_extra_hydroloop_tasks_mapper.

    Fixed it now the right way by making the operation atomic and not relying on explicitly locking a cell.

  • Pedro Gonnet Added 9 commits:

    Added 9 commits:

  • Thanks, that seems to have cured that issue.

    There must be other unsafe accesses in other code as I've seen a couple of random crashes:

          19   1.855469e-06   9.765625e-08       2877          0               153.244
    [00017.1] scheduler.c:scheduler_done():1160: Negative wait!
    ./run.sh: line 10: 25556 Aborted                 (core dumped) ../swift -s -t 16 cosmoVolume.yml

    and

          38   2.968750e-06   7.812500e-08        703          0               140.305
    [00023.0] runner_doiact.h:runner_dopair1_density():757: Trying to interact unsorted cells.
    

    Both running the standard run.sh script on the CosmoVolume. Now I've seen these I've tried to reproduce a fail with a looping test, but to no result so far, it just continues to run. Hmm.

  • Author Developer

    Oh, not good. Those are seemingly unrelated failures. I'll see if I can reproduce them locally and see what happens.

  • Pedro Gonnet Added 1 commit:

    Added 1 commit:

    • ce3782b9 - made engine_marktasks serial again for the non-fixdt case.
  • You may see it, if lucky. I ran 300 loops of 40 steps on the CosmoVolume and saw it twice.

    Meanwhile I found a bug in master from an earlier commit (!179 (merged) now fixed) cannot rule out it being involved, so I'll run that loop again.

  • Pedro Gonnet Added 1 commit:

    Added 1 commit:

    • 4b872bc1 - clean-up scheduler_start, launch the zero-rank tasks with the threadpool.
  • Author Developer

    OK, re-worked some of scheduler_start, to make sure we're not dropping any tasks anywhere. Any luck making it fail again?

  • Not so far, at run 260 and it's now a no-show. I'll leave it running overnight for good measure.

  • Sod's law. Just as I wrote the "unsorted cells" error finally popped up. I'll rebuild with your latest commit and start it again.

  • Author Developer

    OK, I managed to reproduce the crash (negative wait) by running with way too many threads (16 on my 2-core laptop). Trying to track it down.

  • Good, several negative waits came up for me overnight, so definitely still an issue.

  • Pedro Gonnet Added 1 commit:

    Added 1 commit:

    • 13ff2fb8 - use chunked mappers in the threadpool, make sure that the map function actually …
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
Please register or sign in to reply
Loading