Skip to content
Snippets Groups Projects

Mpi periodic gravity

Merged Matthieu Schaller requested to merge mpi_periodic_gravity into master

Here is where I am at with the split periodic gravity calculation over MPI.

I can survive for some steps but I can get stuck in reproducible ways. For instance, running mpirun -np 4 swift_mpi -s -S -c -G -t 4 eagle_12.yml -v 1 always gets stuck on step 43. We end up with an un-balanced number of send-recv on that step. With one node having an extra un-matched recv blocking the calculation. Note that this is not directly after a rebuild but that it may involve cells that have not had any action performed on them since a rebuild.

I have looked a the obvious things that would prevent the task activation mechanism from making a correct symmetric decision but unsuccessfully thus far. Will come back to this in a few days when other commitments have passed. As always, any comments or suggestions are welcome.

Edited by Matthieu Schaller

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • added 1 commit

    • 073d67f2 - Look for grav_mm tasks that will also need send_ti updates

    Compare with previous version

  • So I found another test that fails:

    mpirun -np 2 ../swift_mpi -s -G -S -t 2 eagle_6.yml

    that failed as a send_ti task was not available for the grav_mm task (in engine_init_particles phase). Think that looked simple to fix so have pushed that.

    We now stop at:

           2   6.103516e-07   1.000000e+00    0.00000   3.051758e-07   41   42          615         4620         3633              2512.215      7
    [0001] [00116.6] runner.c:runner_do_end_force():1761: g-particle (id=4719203974277, type=Gas) did not interact gravitationally with all other gparts gp->num_interacted=1038453, total_gparts=1661079 (local num_gparts=879491)

    which is more difficult to understand.

  • Thanks for reporting this one. I am not sure I agree with the first fix though. The thing I was trying to achieve was to not create send_ti tasks for M-M calculations since these do not require the particles. I'll investigate this EAGLE_6 problem.

  • But don't you need the timestep updates to check if the task should be made active regardless?

    Anyway I agree that fix looks wrong as the EAGLE_12 test is now failing in the hydro part of marktasks (EAGLE_6 runs forever without the debugging checks).

    BTW, the failure with the EAGLE_6 check was at:

            /* If the local cell is active, send its ti_end values. */
            if (ci_active_gravity)
              scheduler_activate_send(s, ci->send_ti, cj->nodeID);

    ~line 3767. That ci->send_ti was NULL. Given your reasoning I guess this line should be removed instead?

  • added 1 commit

    • 1cef67c9 - Revert "Look for grav_mm tasks that will also need send_ti updates"

    Compare with previous version

  • added 3 commits

    • f5a50abe - Do not activate communication tasks when unlocking an M-M one.
    • 82e9ddbc - Do not link the M-M tasks in with the other gravity pair tasks.
    • 5bc51ffb - Merge branch 'mpi_periodic_gravity' of gitlab.cosma.dur.ac.uk:swift/swiftsim…

    Compare with previous version

  • So... looks like I had forgotten to hit git push on the last two commits...

    But that still has the original issue.

    The idea, in brief, is to not communicate anything when we have an M-M task. Since we have full knowledge of a neighbouring node's tree, we can use their multipole without having to ask them. It just means we have to drift their multipole when any of our multipole needs it. This is done when unskipping the tasks. That should, in principle, not have any impact on the time-step decision since these are related to particles lower in the tree, which will be communicated if necessary and will have an associated send/recv of the ti_end.

  • mpirun -np 2 ../swift_mpi -s -G -S -t 2 eagle_6.yml

    runs smoothly with the missing push...

  • Yes, and EAGLE_12 sticks waiting for unpaired tasks after a number of steps. I can see that now...

    Edited by Peter W. Draper
  • Exactly. There is one recv_gpart activated that does have a matching send_gpart activated.

  • Although that EAGLE_6 examples gets stuck on step 4317.

  • Matthieu Schaller added 42 commits

    added 42 commits

    Compare with previous version

  • One interesting fact (or not?) is that for the EAGLE_12 box, the rebuild that exists around 10 steps before the code hanging is entirely due to the condition we now impose to rebuild when more than X% of the g-part have moved. If I remove that condition, this rebuild is not triggered and we then don't hang on step 43. This is why I mentionned the other day I thought there was something incorrectly done around rebuild time. Which would tie in with the other discussion we are having about proxy exchanges.

  • So. I am now officially stuck on this one. I can't figure out why one send/recv pair of tasks is not activated in a symmetric way.

  • added 2 commits

    • 24f8a411 - Make sure to reset all the timings when recycling a cell
    • 0b6fc758 - Added a lot of debugging calls to track assymetries between nodes.

    Compare with previous version

  • Things to improve but that do not trigger the bug:

    • proxy construction not using multipoles.
  • added 1 commit

    • 602a1164 - Record also the communications that have been activated between nodes.

    Compare with previous version

  • added 1 commit

    • 20f107e9 - Assign cellID to the top-level cells. Track the send and recvs on step 43.

    Compare with previous version

  • added 1 commit

    • 76587e71 - Traced the bug back to the splitting condition in scheduler_splittask not being…

    Compare with previous version

  • added 1 commit

    • 7a669568 - Removed some of the debugging checks

    Compare with previous version

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
Please register or sign in to reply
Loading