Drift on demand
Fix to the time integration bug that triggers #248 (closed).
As written in the minutes, finding all the cells that need to be drifted can be really difficult in some convoluted situations and I'd like to avoid a full tree walk.
My solution is as follows:
- (Re-)Create a drift task. When a cell has active particles, it's drift task is active
- The
runner_do_drift
function gets replaced by arunner_do_unskip
function that does not touch the particles but just updates the status of the tasks.# - There is a
runner_do_drift_all
function that drifts everything for the cases where we need this (rebuild or snapshot). - If a pair task involves a cell that is inactive and has hence not been drifted, the pair task starts by calling the drift operation on this cell and then carries on with its regular work. Most of the drifts are done via the regular task though.
In terms of performance it looks fairly similar to the current master. But now it is physically correct.
Merge request reports
Activity
mentioned in commit 1a500878
Reassigned to @pdraper
Added 33 commits:
-
6676d261...2ab1e4d0 - 32 commits from branch
master
- bdfd6679 - Merge branch 'master' into drift_on_demand
-
6676d261...2ab1e4d0 - 32 commits from branch
36 36 * @param c The #cell. 37 37 * @param e The #engine containing information about the current time. 38 38 */ 39 __attribute__((always_inline)) INLINE static void cell_is_drifted( 39 __attribute__((always_inline)) INLINE static int cell_is_drifted( 2687 2688 * 2688 2689 * @param e The #engine. 2689 2690 */ 2690 void engine_drift(struct engine *e) { 2691 void engine_unskip(struct engine *e) { 2692 2693 const ticks tic = getticks(); 2694 threadpool_map(&e->threadpool, runner_do_unskip_mapper, e->s->cells_top, 2695 e->s->nr_cells, sizeof(struct cell), 1, e); 2696 2697 if (e->verbose) 2698 message("took %.3f %s.", clocks_from_ticks(getticks() - tic), 2699 clocks_getunit()); 2700 } 2701 2702 void engine_drift_all(struct engine *e) { Added 1 commit:
- c4af3c0d - Corrected documentation and removed MPI exception.
Thanks, but we have an MPI bug to find. This is easy to see just run the
SodShock
, for instance:mpirun -np 2 ../swift_mpi -s -t 2 sodShock.yml
The sanitizer reports:
==16183==ERROR: AddressSanitizer: SEGV on unknown address 0x00000000000c (pc 0x00000052d942 bp 0x2b1e184c4c80 sp 0x2b1e184c4c00 T3) #0 0x52d941 in drift_part /loc/pwda/pdraper/scratch/swift-tests/swiftsim-drift/src/drift.h:70 #1 0x52d941 in cell_drift /loc/pwda/pdraper/scratch/swift-tests/swiftsim-drift/src/cell.c:1049 #2 0x52cfda in cell_drift /loc/pwda/pdraper/scratch/swift-tests/swiftsim-drift/src/cell.c:1016
Checking in a debugger shows that the
xp
pointer is NULL, so probably some initialisation issue.Added 1 commit:
- 24555216 - Correct doxygen keywords
Added 1 commit:
- e9d7520b - Code formatting
Added 1 commit:
- 048cc0af - Remove all debugging information
Added 23 commits:
-
048cc0af...b362c1a7 - 22 commits from branch
master
- ed889c1d - Merge branch 'master' into drift_on_demand
-
048cc0af...b362c1a7 - 22 commits from branch
Still seem to have an issue. When I run with a number of MPI ranks on a single node I get:
[0001] [00147.3] cell.c:cell_unskip_tasks():939: Missing link to send_xv task. [0001] [00147.3] cell.c:cell_unskip_tasks():939: Missing link to send_xv task. application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1 [0001] [00147.3] cell.c:cell_unskip_tasks():907: Missing link to send_xv task.
That is for the command:
mpirun -np 2 ../swift_mpi -a -t 1 -v 1 -s eagle_12.yml
Running with the sanitizer on:
ASAN:SIGSEGV ================================================================= ==2898== ERROR: AddressSanitizer: SEGV on unknown address 0x00000000003a (pc 0x000000531b08 sp 0x7ffc44236770 bp 0x7ffc44236800 T0) AddressSanitizer can not provide additional info. #0 0x531b07 in scheduler_activate /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/scheduler.h:116 #1 0x531b07 in cell_unskip_tasks /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/cell.c:898 ASAN:SIGSEGV ================================================================= ==2899== ERROR: AddressSanitizer: SEGV on unknown address 0x00000000003a (pc 0x000000531b08 sp 0x7ffd1ff64eb0 bp 0x7ffd1ff64f40 T0) AddressSanitizer can not provide additional info. #2 0x51ac1b in runner_do_unskip /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/runner.c:756 #3 0x51ac1b in runner_do_unskip_mapper /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/runner.c:786 #4 0x5514ab in engine_unskip /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/engine.c:2712 #5 0x5514ab in engine_prepare /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/engine.c:2265 #6 0x55908e in engine_step /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/engine.c:2674 #7 0x404f7f in main /cosma6/data/dp004/pdraper/swift/swiftsim-drift/examples/main.c:561 #0 0x531b07 in scheduler_activate /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/scheduler.h:116 #1 0x531b07 in cell_unskip_tasks /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/cell.c:898 #8 0x3df0c1ed1c in __libc_start_main (/lib64/libc.so.6+0x3df0c1ed1c) #9 0x403858 in _start (/cosma6/data/dp004/pdraper/swift/swiftsim-drift/examples/swift_mpi+0x403858) SUMMARY: AddressSanitizer: SEGV /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/scheduler.h:116 scheduler_activate ==2898== ABORTING #2 0x51ac1b in runner_do_unskip /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/runner.c:756 #3 0x51ac1b in runner_do_unskip_mapper /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/runner.c:786 #4 0x5514ab in engine_unskip /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/engine.c:2712 #5 0x5514ab in engine_prepare /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/engine.c:2265 #6 0x55908e in engine_step /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/engine.c:2674 #7 0x404f7f in main /cosma6/data/dp004/pdraper/swift/swiftsim-drift/examples/main.c:561 #8 0x3df0c1ed1c in __libc_start_main (/lib64/libc.so.6+0x3df0c1ed1c) #9 0x403858 in _start (/cosma6/data/dp004/pdraper/swift/swiftsim-drift/examples/swift_mpi+0x403858) SUMMARY: AddressSanitizer: SEGV /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/scheduler.h:116 scheduler_activate ==2899== ABORTING
So thats:
- https://gitlab.cosma.dur.ac.uk/swift/swiftsim/blob/drift_on_demand/src/cell.c#L898
- https://gitlab.cosma.dur.ac.uk/swift/swiftsim/blob/drift_on_demand/src/scheduler.h#L116
This is after step 2, so I suspect repartitioning may be involved.
I can't reproduce the crash. Running with
[0000] [3150700.1] main: MPI is up and running with 2 node(s). Welcome to the cosmological hydrodynamical code ______ _________________ / ___/ | / / _/ ___/_ __/ \__ \| | /| / // // /_ / / ___/ /| |/ |/ // // __/ / / /____/ |__/|__/___/_/ /_/ SPH With Inter-dependent Fine-grained Tasking Version : 0.4.0 Revision: v0.4.0-610-g4e460da8, Branch: drift_on_demand Webpage : www.swiftsim.com Config. options: '--disable-vec --enable-debug --enable-sanitizer' Compiler: GCC, Version: 4.8.1 CFLAGS : '-g -O0 -gdwarf-2 -fvar-tracking-assignments -O3 -fomit-frame-pointer -malign-double -fstrict-aliasing -ffast-math -funroll-loops -march=corei7 -msse4.2 -fno-tree-vectorize -fsanitize=address -fno-omit-frame-pointer -Wall -Wextra -Wno-unused-parameter -Werror' HDF5 library version: 1.8.9 MPI library: Intel(R) MPI Library 5.1.2 for Linux* OS (MPI std v3.0)
the job finishes smoothly. This is running on cosma-4 with
mpirun -np 2 ../swift_mpi -s -t 1 -a eagle_12.yml -n 8193
Edited by Matthieu SchallerAdded 1 commit:
- de558d53 - Move unskip to after rebuild, probably needed this way after repartitioning
So if I move the
engine_unskip()
inengine_prepare()
to after the rebuild section, that seems to fix this problem (makes sense to me, you need to rebuild after repartitioning before doing anything else), at least for theEAGLE_12
volume. However, if I run using theSodShock
volume (same way), I get:[0001] [00009.8] runner_doiact.h:runner_dopair2_force():920: Cell cj not drifted
after step 10. Pushed this as it seems to be some use...
Added 1 commit:
- 6d35c4e9 - Revert "Move unskip to after rebuild, probably needed this way after repartitioning"
And reverted as it breaks the non-MPI
SodShock
test, that now outputs a time step of 0 on step 10 (the rebuild). Hmm.Edited by Peter W. DraperAdded 1 commit:
- 598ae8c0 - Need to unskip before rebuild, except when following a repartition, so make that work
mentioned in commit 8cd4d555
mentioned in issue #248 (closed)
mentioned in issue #241 (closed)
mentioned in merge request !294 (merged)