Skip to content
Snippets Groups Projects

Drift on demand

Merged Matthieu Schaller requested to merge drift_on_demand into master

Fix to the time integration bug that triggers #248 (closed).

As written in the minutes, finding all the cells that need to be drifted can be really difficult in some convoluted situations and I'd like to avoid a full tree walk.

My solution is as follows:

  • (Re-)Create a drift task. When a cell has active particles, it's drift task is active
  • The runner_do_drift function gets replaced by a runner_do_unskip function that does not touch the particles but just updates the status of the tasks.#
  • There is a runner_do_drift_all function that drifts everything for the cases where we need this (rebuild or snapshot).
  • If a pair task involves a cell that is inactive and has hence not been drifted, the pair task starts by calling the drift operation on this cell and then carries on with its regular work. Most of the drifts are done via the regular task though.

In terms of performance it looks fairly similar to the current master. But now it is physically correct.

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
36 36 * @param c The #cell.
37 37 * @param e The #engine containing information about the current time.
38 38 */
39 __attribute__((always_inline)) INLINE static void cell_is_drifted(
39 __attribute__((always_inline)) INLINE static int cell_is_drifted(
  • 2687 2688 *
    2688 2689 * @param e The #engine.
    2689 2690 */
    2690 void engine_drift(struct engine *e) {
    2691 void engine_unskip(struct engine *e) {
  • 2687 2688 *
    2688 2689 * @param e The #engine.
    2689 2690 */
    2690 void engine_drift(struct engine *e) {
    2691 void engine_unskip(struct engine *e) {
    2692
    2693 const ticks tic = getticks();
    2694 threadpool_map(&e->threadpool, runner_do_unskip_mapper, e->s->cells_top,
    2695 e->s->nr_cells, sizeof(struct cell), 1, e);
    2696
    2697 if (e->verbose)
    2698 message("took %.3f %s.", clocks_from_ticks(getticks() - tic),
    2699 clocks_getunit());
    2700 }
    2701
    2702 void engine_drift_all(struct engine *e) {
  • Added 1 commit:

    • c4af3c0d - Corrected documentation and removed MPI exception.
  • Just applied the changes you suggested above.

  • Thanks, but we have an MPI bug to find. This is easy to see just run the SodShock, for instance:

    mpirun -np 2 ../swift_mpi -s -t 2 sodShock.yml

    The sanitizer reports:

    ==16183==ERROR: AddressSanitizer: SEGV on unknown address 0x00000000000c (pc 0x00000052d942 bp 0x2b1e184c4c80 sp 0x2b1e184c4c00 T3)
        #0 0x52d941 in drift_part /loc/pwda/pdraper/scratch/swift-tests/swiftsim-drift/src/drift.h:70
        #1 0x52d941 in cell_drift /loc/pwda/pdraper/scratch/swift-tests/swiftsim-drift/src/cell.c:1049
        #2 0x52cfda in cell_drift /loc/pwda/pdraper/scratch/swift-tests/swiftsim-drift/src/cell.c:1016

    Checking in a debugger shows that the xp pointer is NULL, so probably some initialisation issue.

  • Added 1 commit:

  • Thanks. Looks like I missed a task dependency. Annoyingly it does not get triggered every time...

  • We don't send the xparts over MPI. So we need to drift the cell on the sending side before the send. As the dependency drift-->send was missing in some cases the drift would not have occurred before the send.

  • Did you have a fix for this? Not seen a push.

  • Almost. :smile: I think I may have uncovered another issue.

  • Matthieu Schaller Added 2 commits:

    Added 2 commits:

    • ab72a00c - Add drift-->send dependency and update ti_old on reception
    • f89460ea - More debugging info
  • Matthieu Schaller Added 2 commits:

    Added 2 commits:

  • Added 1 commit:

  • Added 1 commit:

    • 048cc0af - Remove all debugging information
  • Ok, now I am happy with it. Sorry the git log is rather dirty as a consequence of using multiple machines to debug this.

  • Matthieu Schaller Added 23 commits:

    Added 23 commits:

  • Still seem to have an issue. When I run with a number of MPI ranks on a single node I get:

    [0001] [00147.3] cell.c:cell_unskip_tasks():939: Missing link to send_xv task.
    [0001] [00147.3] cell.c:cell_unskip_tasks():939: Missing link to send_xv task.
    application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
    [0001] [00147.3] cell.c:cell_unskip_tasks():907: Missing link to send_xv task.

    That is for the command:

    mpirun -np 2 ../swift_mpi -a -t 1 -v 1 -s eagle_12.yml

    Running with the sanitizer on:

    ASAN:SIGSEGV
    =================================================================
    ==2898== ERROR: AddressSanitizer: SEGV on unknown address 0x00000000003a (pc 0x000000531b08 sp 0x7ffc44236770 bp 0x7ffc44236800 T0)
    AddressSanitizer can not provide additional info.
        #0 0x531b07 in scheduler_activate /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/scheduler.h:116
        #1 0x531b07 in cell_unskip_tasks /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/cell.c:898
    ASAN:SIGSEGV
    =================================================================
    ==2899== ERROR: AddressSanitizer: SEGV on unknown address 0x00000000003a (pc 0x000000531b08 sp 0x7ffd1ff64eb0 bp 0x7ffd1ff64f40 T0)
    AddressSanitizer can not provide additional info.
        #2 0x51ac1b in runner_do_unskip /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/runner.c:756
        #3 0x51ac1b in runner_do_unskip_mapper /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/runner.c:786
        #4 0x5514ab in engine_unskip /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/engine.c:2712
        #5 0x5514ab in engine_prepare /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/engine.c:2265
        #6 0x55908e in engine_step /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/engine.c:2674
        #7 0x404f7f in main /cosma6/data/dp004/pdraper/swift/swiftsim-drift/examples/main.c:561
        #0 0x531b07 in scheduler_activate /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/scheduler.h:116
        #1 0x531b07 in cell_unskip_tasks /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/cell.c:898
        #8 0x3df0c1ed1c in __libc_start_main (/lib64/libc.so.6+0x3df0c1ed1c)
        #9 0x403858 in _start (/cosma6/data/dp004/pdraper/swift/swiftsim-drift/examples/swift_mpi+0x403858)
    SUMMARY: AddressSanitizer: SEGV /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/scheduler.h:116 scheduler_activate
    ==2898== ABORTING
        #2 0x51ac1b in runner_do_unskip /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/runner.c:756
        #3 0x51ac1b in runner_do_unskip_mapper /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/runner.c:786
        #4 0x5514ab in engine_unskip /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/engine.c:2712
        #5 0x5514ab in engine_prepare /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/engine.c:2265
        #6 0x55908e in engine_step /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/engine.c:2674
        #7 0x404f7f in main /cosma6/data/dp004/pdraper/swift/swiftsim-drift/examples/main.c:561
        #8 0x3df0c1ed1c in __libc_start_main (/lib64/libc.so.6+0x3df0c1ed1c)
        #9 0x403858 in _start (/cosma6/data/dp004/pdraper/swift/swiftsim-drift/examples/swift_mpi+0x403858)
    SUMMARY: AddressSanitizer: SEGV /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/scheduler.h:116 scheduler_activate
    ==2899== ABORTING

    So thats:

    This is after step 2, so I suspect repartitioning may be involved.

  • Peter W. Draper Status changed to closed

    Status changed to closed

  • Peter W. Draper Status changed to reopened

    Status changed to reopened

  • Thanks. That's very confusing as I haven't touched the construction of the send_xv tasks. Also it works when using 4 ranks... Ok, back to the drawing board.

  • I can't reproduce the crash. Running with

    [0000] [3150700.1] main: MPI is up and running with 2 node(s).
     Welcome to the cosmological hydrodynamical code
        ______       _________________
       / ___/ |     / /  _/ ___/_  __/
       \__ \| | /| / // // /_   / /   
      ___/ /| |/ |/ // // __/  / /    
     /____/ |__/|__/___/_/    /_/     
     SPH With Inter-dependent Fine-grained Tasking
    
     Version : 0.4.0
     Revision: v0.4.0-610-g4e460da8, Branch: drift_on_demand
     Webpage : www.swiftsim.com
    
     Config. options: '--disable-vec --enable-debug --enable-sanitizer'
    
     Compiler: GCC, Version: 4.8.1
     CFLAGS  : '-g -O0  -gdwarf-2 -fvar-tracking-assignments -O3 -fomit-frame-pointer -malign-double -fstrict-aliasing -ffast-math -funroll-loops -march=corei7 -msse4.2 -fno-tree-vectorize -fsanitize=address -fno-omit-frame-pointer  -Wall -Wextra -Wno-unused-parameter -Werror'
    
     HDF5 library version: 1.8.9
     MPI library: Intel(R) MPI Library 5.1.2 for Linux* OS (MPI std v3.0)

    the job finishes smoothly. This is running on cosma-4 with

    mpirun -np 2 ../swift_mpi -s -t 1 -a eagle_12.yml -n 8193
    Edited by Matthieu Schaller
  • Try that using --with-metis, otherwise repartitioning will be suppressed.

  • Peter W. Draper Added 1 commit:

    Added 1 commit:

    • de558d53 - Move unskip to after rebuild, probably needed this way after repartitioning
  • So if I move the engine_unskip() in engine_prepare() to after the rebuild section, that seems to fix this problem (makes sense to me, you need to rebuild after repartitioning before doing anything else), at least for the EAGLE_12 volume. However, if I run using the SodShock volume (same way), I get:

    [0001] [00009.8] runner_doiact.h:runner_dopair2_force():920: Cell cj not drifted

    after step 10. Pushed this as it seems to be some use...

  • Peter W. Draper Added 1 commit:

    Added 1 commit:

    • 6d35c4e9 - Revert "Move unskip to after rebuild, probably needed this way after repartitioning"
  • And reverted as it breaks the non-MPI SodShock test, that now outputs a time step of 0 on step 10 (the rebuild). Hmm.

    Edited by Peter W. Draper
  • I am getting more and more confused by this. I am starting to wonder how master works...

    Agreed, we don't need to unskip if we just had a repartition. But in other cases we must have it where it is.

  • Peter W. Draper Added 1 commit:

    Added 1 commit:

    • 598ae8c0 - Need to unskip before rebuild, except when following a repartition, so make that work
  • So if accept that is the way it must be, this patch is passing all my tests so far.

  • Up to you.

    In any case, I am planning on starting a serious campaign to strengthen the MPI version when I come back in January. I am not yet confident that we do the right thing in that case.

  • Good. Seems to be working for all the tests I usually try, so I will accept this for now.

  • Peter W. Draper mentioned in commit 8cd4d555

    mentioned in commit 8cd4d555

  • Peter W. Draper Status changed to merged

    Status changed to merged

  • mentioned in issue #248 (closed)

  • mentioned in issue #241 (closed)

  • Matthieu Schaller mentioned in merge request !294 (merged)

    mentioned in merge request !294 (merged)

  • Please register or sign in to reply
    Loading