Drift on demand

mentioned in commit 1a500878

Looks good, thanks for fixing this!

Reassigned to @pdraper

Added 33 commits:

6676d261...2ab1e4d0 - 32 commits from branch master
bdfd6679 - Merge branch 'master' into drift_on_demand

Just merged the latest master changes in.

The documentation is now misleading. We have a result and it does not need SWIFT_DEBUG_CHECKS defined.

The documentation for this function is out of date.

This function doesn't have any documentation. It should.

Added 1 commit:

c4af3c0d - Corrected documentation and removed MPI exception.

Just applied the changes you suggested above.

Thanks, but we have an MPI bug to find. This is easy to see just run the SodShock, for instance:

mpirun -np 2 ../swift_mpi -s -t 2 sodShock.yml

The sanitizer reports:

==16183==ERROR: AddressSanitizer: SEGV on unknown address 0x00000000000c (pc 0x00000052d942 bp 0x2b1e184c4c80 sp 0x2b1e184c4c00 T3)
    #0 0x52d941 in drift_part /loc/pwda/pdraper/scratch/swift-tests/swiftsim-drift/src/drift.h:70
    #1 0x52d941 in cell_drift /loc/pwda/pdraper/scratch/swift-tests/swiftsim-drift/src/cell.c:1049
    #2 0x52cfda in cell_drift /loc/pwda/pdraper/scratch/swift-tests/swiftsim-drift/src/cell.c:1016

Checking in a debugger shows that the xp pointer is NULL, so probably some initialisation issue.

Added 1 commit:

24555216 - Correct doxygen keywords

Thanks. Looks like I missed a task dependency. Annoyingly it does not get triggered every time...

We don't send the xparts over MPI. So we need to drift the cell on the sending side before the send. As the dependency drift-->send was missing in some cases the drift would not have occurred before the send.

Did you have a fix for this? Not seen a push.

Almost. I think I may have uncovered another issue.

Added 2 commits:

ab72a00c - Add drift-->send dependency and update ti_old on reception
f89460ea - More debugging info

Added 2 commits:

0656604c - No debugging code any more
4e460da8 - Ready for merge

Added 1 commit:

e9d7520b - Code formatting

Added 1 commit:

048cc0af - Remove all debugging information

Ok, now I am happy with it. Sorry the git log is rather dirty as a consequence of using multiple machines to debug this.

Added 23 commits:

048cc0af...b362c1a7 - 22 commits from branch master
ed889c1d - Merge branch 'master' into drift_on_demand

Still seem to have an issue. When I run with a number of MPI ranks on a single node I get:

[0001] [00147.3] cell.c:cell_unskip_tasks():939: Missing link to send_xv task.
[0001] [00147.3] cell.c:cell_unskip_tasks():939: Missing link to send_xv task.
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
[0001] [00147.3] cell.c:cell_unskip_tasks():907: Missing link to send_xv task.

That is for the command:

mpirun -np 2 ../swift_mpi -a -t 1 -v 1 -s eagle_12.yml

Running with the sanitizer on:

ASAN:SIGSEGV
=================================================================
==2898== ERROR: AddressSanitizer: SEGV on unknown address 0x00000000003a (pc 0x000000531b08 sp 0x7ffc44236770 bp 0x7ffc44236800 T0)
AddressSanitizer can not provide additional info.
    #0 0x531b07 in scheduler_activate /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/scheduler.h:116
    #1 0x531b07 in cell_unskip_tasks /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/cell.c:898
ASAN:SIGSEGV
=================================================================
==2899== ERROR: AddressSanitizer: SEGV on unknown address 0x00000000003a (pc 0x000000531b08 sp 0x7ffd1ff64eb0 bp 0x7ffd1ff64f40 T0)
AddressSanitizer can not provide additional info.
    #2 0x51ac1b in runner_do_unskip /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/runner.c:756
    #3 0x51ac1b in runner_do_unskip_mapper /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/runner.c:786
    #4 0x5514ab in engine_unskip /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/engine.c:2712
    #5 0x5514ab in engine_prepare /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/engine.c:2265
    #6 0x55908e in engine_step /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/engine.c:2674
    #7 0x404f7f in main /cosma6/data/dp004/pdraper/swift/swiftsim-drift/examples/main.c:561
    #0 0x531b07 in scheduler_activate /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/scheduler.h:116
    #1 0x531b07 in cell_unskip_tasks /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/cell.c:898
    #8 0x3df0c1ed1c in __libc_start_main (/lib64/libc.so.6+0x3df0c1ed1c)
    #9 0x403858 in _start (/cosma6/data/dp004/pdraper/swift/swiftsim-drift/examples/swift_mpi+0x403858)
SUMMARY: AddressSanitizer: SEGV /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/scheduler.h:116 scheduler_activate
==2898== ABORTING
    #2 0x51ac1b in runner_do_unskip /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/runner.c:756
    #3 0x51ac1b in runner_do_unskip_mapper /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/runner.c:786
    #4 0x5514ab in engine_unskip /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/engine.c:2712
    #5 0x5514ab in engine_prepare /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/engine.c:2265
    #6 0x55908e in engine_step /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/engine.c:2674
    #7 0x404f7f in main /cosma6/data/dp004/pdraper/swift/swiftsim-drift/examples/main.c:561
    #8 0x3df0c1ed1c in __libc_start_main (/lib64/libc.so.6+0x3df0c1ed1c)
    #9 0x403858 in _start (/cosma6/data/dp004/pdraper/swift/swiftsim-drift/examples/swift_mpi+0x403858)
SUMMARY: AddressSanitizer: SEGV /cosma6/data/dp004/pdraper/swift/swiftsim-drift/src/scheduler.h:116 scheduler_activate
==2899== ABORTING

So thats:

This is after step 2, so I suspect repartitioning may be involved.

Status changed to closed

Status changed to reopened

Thanks. That's very confusing as I haven't touched the construction of the send_xv tasks. Also it works when using 4 ranks... Ok, back to the drawing board.

I can't reproduce the crash. Running with

[0000] [3150700.1] main: MPI is up and running with 2 node(s).
 Welcome to the cosmological hydrodynamical code
    ______       _________________
   / ___/ |     / /  _/ ___/_  __/
   \__ \| | /| / // // /_   / /   
  ___/ /| |/ |/ // // __/  / /    
 /____/ |__/|__/___/_/    /_/     
 SPH With Inter-dependent Fine-grained Tasking

 Version : 0.4.0
 Revision: v0.4.0-610-g4e460da8, Branch: drift_on_demand
 Webpage : www.swiftsim.com

 Config. options: '--disable-vec --enable-debug --enable-sanitizer'

 Compiler: GCC, Version: 4.8.1
 CFLAGS  : '-g -O0  -gdwarf-2 -fvar-tracking-assignments -O3 -fomit-frame-pointer -malign-double -fstrict-aliasing -ffast-math -funroll-loops -march=corei7 -msse4.2 -fno-tree-vectorize -fsanitize=address -fno-omit-frame-pointer  -Wall -Wextra -Wno-unused-parameter -Werror'

 HDF5 library version: 1.8.9
 MPI library: Intel(R) MPI Library 5.1.2 for Linux* OS (MPI std v3.0)

the job finishes smoothly. This is running on cosma-4 with

mpirun -np 2 ../swift_mpi -s -t 1 -a eagle_12.yml -n 8193

Try that using --with-metis, otherwise repartitioning will be suppressed.

Added 1 commit:

de558d53 - Move unskip to after rebuild, probably needed this way after repartitioning

So if I move the engine_unskip() in engine_prepare() to after the rebuild section, that seems to fix this problem (makes sense to me, you need to rebuild after repartitioning before doing anything else), at least for the EAGLE_12 volume. However, if I run using the SodShock volume (same way), I get:

[0001] [00009.8] runner_doiact.h:runner_dopair2_force():920: Cell cj not drifted

after step 10. Pushed this as it seems to be some use...

Added 1 commit:

6d35c4e9 - Revert "Move unskip to after rebuild, probably needed this way after repartitioning"

And reverted as it breaks the non-MPI SodShock test, that now outputs a time step of 0 on step 10 (the rebuild). Hmm.

I am getting more and more confused by this. I am starting to wonder how master works...

Agreed, we don't need to unskip if we just had a repartition. But in other cases we must have it where it is.

Added 1 commit:

598ae8c0 - Need to unskip before rebuild, except when following a repartition, so make that work

So if accept that is the way it must be, this patch is passing all my tests so far.

Up to you.

In any case, I am planning on starting a serious campaign to strengthen the MPI version when I come back in January. I am not yet confident that we do the right thing in that case.

Good. Seems to be working for all the tests I usually try, so I will accept this for now.

mentioned in commit 8cd4d555

Status changed to merged

mentioned in issue #248 (closed)

mentioned in issue #241 (closed)

mentioned in merge request !294 (merged)

 * @param c The #cell.
 * @param e The #engine containing information about the current time.
 */
 __attribute__((always_inline)) INLINE static void cell_is_drifted(
 __attribute__((always_inline)) INLINE static int cell_is_drifted(

+*
 * @param e The #engine.
 */
 void engine_drift(struct engine *e) {
 void engine_unskip(struct engine *e) {

+*
 * @param e The #engine.
 */
 void engine_drift(struct engine *e) {
 void engine_unskip(struct engine *e) {
  const ticks tic = getticks();
  threadpool_map(&e->threadpool, runner_do_unskip_mapper, e->s->cells_top,
                 e->s->nr_cells, sizeof(struct cell), 1, e);
  if (e->verbose)
    message("took %.3f %s.", clocks_from_ticks(getticks() - tic),
            clocks_getunit());
+}
 void engine_drift_all(struct engine *e) {

Drift on demand

Merge request reports

Activity