Skip to content
Snippets Groups Projects

MPI send/recv fixes for inactive cells

Merged Matthieu Schaller requested to merge new_timeline_mpi into master

This is open so that we have a live diff of the changes to solve the MPI crisis. Also helpful to keep track of the changes.

Contains all of @nnrw56 @pdraper @matthieu's changes to this stage.

Also see #256 (closed).

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Matthieu Schaller Added 27 commits:

    Added 27 commits:

  • Matthieu Schaller Added 2 commits:

    Added 2 commits:

    • 3e441ee0 - 1 commit from branch master
    • 7ec50a00 - Merge branch 'master' into new_timeline_mpi
  • Matthieu Schaller Added 2 commits:

    Added 2 commits:

    • 13444dab - Check that all particles and cells have a time-step assigned at the end of the initialisation step.
    • 08d40fbd - Merge branch 'new_timeline_mpi' of gitlab.cosma.dur.ac.uk:swift/swiftsim into new_timeline_mpi
  • Added 1 commit:

    • cb6878b6 - Only check lack of time-step assignment for local particles
  • Added 1 commit:

    • 014cc09e - Restore default check for cells in active.h
  • Added 1 commit:

    • 39da7fd3 - Updated the test in the scheduler to only check for pairs that have an active local cell.
  • Pedro Gonnet Added 2 commits:

    Added 2 commits:

    • e371bec3 - do not unskip cells that have no parts or gparts.
    • 053c85dc - update engine_collect_kick logic to deal with foreign cells that received their ti_end_min.
  • Pedro Gonnet Added 1 commit:

    Added 1 commit:

  • Pedro Gonnet Added 1 commit:

    Added 1 commit:

    • 6fcd9386 - add a check before sending particles to make sure they've been drifted.
  • Pedro Gonnet Added 1 commit:

    Added 1 commit:

    • 4135886c - only send/recv rhos and tis for active cells.
  • Pedro Gonnet Added 1 commit:

    Added 1 commit:

    • 52ba1a17 - do the same for engine_marktasks, clean up activation and add more debugging info for tasks.
  • Matthieu Schaller Added 7 commits:

    Added 7 commits:

  • Reassigned to @pdraper

  • Matthieu Schaller Title changed from [WIP] New timeline mpi to MPI send/recv fixes for inactive cells

    Title changed from [WIP] New timeline mpi to MPI send/recv fixes for inactive cells

  • So, it all seems good to me. Did not get other crashes.

    Two questions left for @nnrw56 :

    • Are we keeping the additional task subtypes for send/recv ? If so, should we make sure they are used everywhere ?
    • Do we keep the ti_run property of the tasks ? What is it supposed to represent ?

    I would suggest to merge this in the master if @pdraper is also happy with it. We can then focus on getting the repartitionning to work.

  • Peter W. Draper Added 1 commit:

    Added 1 commit:

    • 43d8be2d - After repartitioning don't check if all cells have been drifted
  • So yeah, those are two changes that I decided to push as they were extremely useful:

    • The send/recv subtypes are needed so that when debugging we can actually distinguish between send_xv and send_rho tasks. I've pushed a change to make their use a bit more consistent.
    • The ti_run field, which is only added when compiled with SWIFT_DEBUG_CHECKS, is quite useful for trying to figure out when a task was last executed, i.e. in which time step.
  • The problem with repartitioning looks to be fixed by 43d8be2d, basically we now skip the check that all local cells have been drifted. After repartitioning this isn't true, although all particles have. This is because we start using cells that have previously been inactive (and we exchange particles not cells). Makes sense to me anyway.

    Re: longer runs. After 9000+ steps the EAGLE_25 4x12 test is failing with a smoothing length failed to converge warning. Trying to catch it in the act.

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading