MPI send/recv fixes for inactive cells
This is open so that we have a live diff of the changes to solve the MPI crisis. Also helpful to keep track of the changes.
Contains all of @nnrw56 @pdraper @matthieu's changes to this stage.
Also see #256 (closed).
Merge request reports
Activity
Added 27 commits:
-
66da9d09...c1f413f1 - 26 commits from branch
master
- 54d48ae1 - Merge branch 'master' into new_timeline_mpi
-
66da9d09...c1f413f1 - 26 commits from branch
Added 1 commit:
- cb6878b6 - Only check lack of time-step assignment for local particles
Added 1 commit:
- 014cc09e - Restore default check for cells in active.h
Added 1 commit:
- 39da7fd3 - Updated the test in the scheduler to only check for pairs that have an active local cell.
Added 1 commit:
- d87df577 - formatting.
Added 1 commit:
- 6fcd9386 - add a check before sending particles to make sure they've been drifted.
Added 1 commit:
- 4135886c - only send/recv rhos and tis for active cells.
Added 1 commit:
- 52ba1a17 - do the same for engine_marktasks, clean up activation and add more debugging info for tasks.
Added 7 commits:
-
52ba1a17...3264ba07 - 6 commits from branch
master
- ec28f9d0 - Merge branch 'master' into new_timeline_mpi
-
52ba1a17...3264ba07 - 6 commits from branch
Reassigned to @pdraper
So, it all seems good to me. Did not get other crashes.
Two questions left for @nnrw56 :
- Are we keeping the additional task subtypes for send/recv ? If so, should we make sure they are used everywhere ?
- Do we keep the ti_run property of the tasks ? What is it supposed to represent ?
I would suggest to merge this in the master if @pdraper is also happy with it. We can then focus on getting the repartitionning to work.
Added 1 commit:
- 43d8be2d - After repartitioning don't check if all cells have been drifted
So yeah, those are two changes that I decided to push as they were extremely useful:
- The send/recv subtypes are needed so that when debugging we can actually distinguish between
send_xv
andsend_rho
tasks. I've pushed a change to make their use a bit more consistent. - The
ti_run
field, which is only added when compiled withSWIFT_DEBUG_CHECKS
, is quite useful for trying to figure out when a task was last executed, i.e. in which time step.
- The send/recv subtypes are needed so that when debugging we can actually distinguish between
The problem with repartitioning looks to be fixed by 43d8be2d, basically we now skip the check that all local cells have been drifted. After repartitioning this isn't true, although all particles have. This is because we start using cells that have previously been inactive (and we exchange particles not cells). Makes sense to me anyway.
Re: longer runs. After 9000+ steps the EAGLE_25 4x12 test is failing with a
smoothing length failed to converge
warning. Trying to catch it in the act.