Mpi periodic gravity
Here is where I am at with the split periodic gravity calculation over MPI.
I can survive for some steps but I can get stuck in reproducible ways. For instance, running mpirun -np 4 swift_mpi -s -S -c -G -t 4 eagle_12.yml -v 1
always gets
stuck on step 43.
We end up with an un-balanced number of send-recv on that step. With one node having an extra un-matched recv blocking the calculation.
Note that this is not directly after a rebuild but that it may involve cells that have not had any action performed on them since a rebuild.
I have looked a the obvious things that would prevent the task activation mechanism from making a correct symmetric decision but unsuccessfully thus far. Will come back to this in a few days when other commitments have passed. As always, any comments or suggestions are welcome.
Merge request reports
Activity
added 1 commit
- 073d67f2 - Look for grav_mm tasks that will also need send_ti updates
So I found another test that fails:
mpirun -np 2 ../swift_mpi -s -G -S -t 2 eagle_6.yml
that failed as a
send_ti
task was not available for thegrav_mm
task (in engine_init_particles phase). Think that looked simple to fix so have pushed that.We now stop at:
2 6.103516e-07 1.000000e+00 0.00000 3.051758e-07 41 42 615 4620 3633 2512.215 7 [0001] [00116.6] runner.c:runner_do_end_force():1761: g-particle (id=4719203974277, type=Gas) did not interact gravitationally with all other gparts gp->num_interacted=1038453, total_gparts=1661079 (local num_gparts=879491)
which is more difficult to understand.
But don't you need the timestep updates to check if the task should be made active regardless?
Anyway I agree that fix looks wrong as the EAGLE_12 test is now failing in the hydro part of marktasks (EAGLE_6 runs forever without the debugging checks).
BTW, the failure with the EAGLE_6 check was at:
/* If the local cell is active, send its ti_end values. */ if (ci_active_gravity) scheduler_activate_send(s, ci->send_ti, cj->nodeID);
~line 3767. That
ci->send_ti
was NULL. Given your reasoning I guess this line should be removed instead?added 1 commit
- 1cef67c9 - Revert "Look for grav_mm tasks that will also need send_ti updates"
So... looks like I had forgotten to hit
git push
on the last two commits...But that still has the original issue.
The idea, in brief, is to not communicate anything when we have an M-M task. Since we have full knowledge of a neighbouring node's tree, we can use their multipole without having to ask them. It just means we have to drift their multipole when any of our multipole needs it. This is done when unskipping the tasks. That should, in principle, not have any impact on the time-step decision since these are related to particles lower in the tree, which will be communicated if necessary and will have an associated send/recv of the ti_end.
Yes, and EAGLE_12 sticks waiting for unpaired tasks after a number of steps. I can see that now...
Edited by Peter W. Draperadded 42 commits
-
5bc51ffb...8c9ff9f2 - 41 commits from branch
master
- 1de6620c - Merge branch 'master' into mpi_periodic_gravity
-
5bc51ffb...8c9ff9f2 - 41 commits from branch
One interesting fact (or not?) is that for the EAGLE_12 box, the rebuild that exists around 10 steps before the code hanging is entirely due to the condition we now impose to rebuild when more than X% of the g-part have moved. If I remove that condition, this rebuild is not triggered and we then don't hang on step 43. This is why I mentionned the other day I thought there was something incorrectly done around rebuild time. Which would tie in with the other discussion we are having about proxy exchanges.
added 1 commit
- 602a1164 - Record also the communications that have been activated between nodes.
added 1 commit
- 20f107e9 - Assign cellID to the top-level cells. Track the send and recvs on step 43.
added 1 commit
- 76587e71 - Traced the bug back to the splitting condition in scheduler_splittask not being…