Do not recurse to lower cells within the sub-tasks if they don't contain any particles.
This should fix #274 (closed). @jwillis could you please check that your setup that crashes works with this version ?
It builds on your proposed fix but I moved the decision of aborting higher up in the recursion.
Merge request reports
Activity
I merged your branch into mine, which also got me the latest
master
. Now the code fails a memory allocation when the sanitizer is switched on.I get the following error:
==3291== WARNING: AddressSanitizer failed to allocate 0x000214799833 bytes
on line 646 of
engine.c
inengine_redistribute
.While running:
../swift_mpi -s -g -a -e -t 16 eagle_25.yml -n 4096
Yeah, the
master
crashes as well. With:==26231== WARNING: AddressSanitizer failed to allocate 0x000214799833 bytes ================================================================= ==26231== ERROR: AddressSanitizer: unknown-crash on address 0x000000000000 at pc 0x2b6d2a5a054d bp 0x7ffefb67b460 sp 0x7ffefb67ac20 WRITE of size 1053315968 at 0x000000000000 thread T0 #0 0x2b6d2a5a054c in __interceptor_memcpy.part.13 /cosma/local/software/gcc/build/4.8.1/build/x86_64-unknown-linux-gnu/libsanitizer/asan/../../../../gcc-4.8.1/libsanitizer/asan/asan_interceptors.cc:288 #1 0x44bed8 in engine_redistribute /cosma5/data/dp004/dc-will2/SWIFT/master/swiftsim/src/engine.c:636 #2 0x406da0 in main /cosma5/data/dp004/dc-will2/SWIFT/master/swiftsim/examples/main.c:589 #3 0x2b6d2e744d5c in __libc_start_main (/lib64/libc.so.6+0x1ed5c) #4 0x403c48 in _start (/cosma5/data/dp004/dc-will2/SWIFT/master/swiftsim/examples/swift_mpi+0x403c48) SUMMARY: AddressSanitizer: unknown-crash /cosma/local/software/gcc/build/4.8.1/build/x86_64-unknown-linux-gnu/libsanitizer/asan/../../../../gcc-4.8.1/libsanitizer/asan/asan_interceptors.cc:288 __interceptor_memcpy.part.13 Shadow bytes around the buggy address: 0x00007fff7fb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x00007fff7fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x00007fff7fd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x00007fff7fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x00007fff7ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>0x00007fff8000:[00]00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x00007fff8010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x00007fff8020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x00007fff8030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x00007fff8040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x00007fff8050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Heap righ redzone: fb Freed Heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack partial redzone: f4 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 ASan internal: fe ==26231== ABORTING
I configured it with:
./configure CC="gcc" --disable-doxygen-doc --enable-debug --enable-sanitizer --disable-optimization --enable-debugging-checks --with-metis
I get a divide by zero in
runner_iact_nonsym_force
because a particle has a smoothing length of zero. I think this is related to my issue: #274 (closed) even though gravity is turned offJust trying that now. One thing though, you know how the check you put in for the particle counts is in
DOSUB1
+DOSUB2
. The tasks themselves shouldn't be created should be created in the first place, right? So can the check not go inengine_makehydroloop_tasks
as we have the cell particle counts fromexchange_cells
already. Or is it a problem with how we push the tasks down from the top level cells to all the progenies?The problem I have now, is that after your changes in the subtasks_only_for_cells_with_parts I can't reproduce my old bug in my branch
Well that was the point of my fixes.
Edited by Matthieu SchallerHaha well, the fix I put in was at the
DOPAIR1
level not theDOSUB1
level didn't seem to fix the problem and now when I undo your changes I still can't reproduce it.Edited by James WillisI've ran the
master
again with the sanitizer turned off and it seems to run fine on one node with gravity off. I will try multiple nodes with and without gravity. I don't know what changed between when I started debugging that gravity problem and now, because I could run the sanitizer fine with MPI and not run into the same issueConfig. options: 'CC=gcc --disable-doxygen-doc --enable-debug --enable-debugging-checks --with-metis' Compiler: GCC, Version: 4.8.1 CFLAGS : '-g -O0 -gdwarf-2 -fvar-tracking-assignments -O3 -fomit-frame-pointer -malign-double -fstrict-aliasing -ffast-math -funroll-loops -march=corei7-avx -mavx -Wall -Wextra -Wno-unused-parameter -Werror' HDF5 library version: 1.8.9 MPI library: Intel(R) MPI Library 5.1.3 for Linux* OS (MPI std v3.0) METIS library version: 5.1.0 mpirun -np 2 ../swift_mpi -s -a -e -t 16 eagle_25.yml -n 4096
Edited by James Willismentioned in issue #283 (closed)
Tripped over (what looks like) this bug myself, so I've had a go at tracking where it originates. After a lot of disappointing bisections, I realised that our messy history doesn't really work well with that technique, so I tried a targeted revert instead and seems to have worked first time. After reverting b8f5850f (gravity multi-dt) my test now runs.
Re: reverting b8f5850f, realised I had missed out one configure option, I'd used previously, namely the sanitizer, so reconfigured with that and I get a new error about receiving undrifted particles in step 1. So a false dawn of some kind. BTW, undid
--enable-sanitizer
and the job does run again. Time to look farther afield.@jwillis but you crash even without the sanitizer, no ?
What setup are you running @pdraper?
Setup:
swift/c5/gcc/intelmpi/5.1.3 ./configure --with-metis --enable-debug --enable-debugging-checks mpirun -np 4 ../swift_mpi -a -t 16 -s -n 5000 -v 2 eagle_25.yml
all on master. The initial problem I was seeing was a memory runaway as a recursive call just kept going...
Edited by Peter W. DraperBecause mine crashes using the latest
master
and at commit 0bed4834 before the gravity merge. I use your setup as well:module load swift/c5/gcc/intelmpi/5.1.3 ./configure --disable-doxygen-doc --enable-debug --enable-debugging-checks --with-metis mpirun -np 4 ../swift_mpi -a -e -t 16 -s -n 4096 eagle_25.yml
apart from
-e -v 2
, but that shouldn't make a difference.Edited by James WillisRunning your setup with gravity I get the following errors:
mlx4: local QP operation err (QPN 03bcd6, WQE index 1ae0000, vendor syndrome 70, opcode = 5e) m5418:UCM:3a10:5383f800: 140466316 us(140466316 us!!!): DTO completion ERR: status 2, op Invalid DTO OP?, vendor_err 0x70 - 0.6.154.162 [0000] [00139.9] engine_redistribute: request 0 has error 'No MPI error'. [0000] [00139.9] engine_redistribute: request 1 has error 'No MPI error'. [0000] [00139.9] engine_redistribute: request 2 has error 'No MPI error'. [0000] [00139.9] engine_redistribute: request 3 has error 'No MPI error'. [0000] [00139.9] engine_redistribute: request 4 has error 'No MPI error'. [0000] [00139.9] engine_redistribute: request 5 has error 'No MPI error'. [0000] [00139.9] engine_redistribute: request 6 has error 'No MPI error'. [0000] [00139.9] engine_redistribute: request 7 has error 'No MPI error'. [0000] [00139.9] engine_redistribute: request 8 has error 'No MPI error'. [0000] [00139.9] engine_redistribute: request 9 has error 'No MPI error'. [0000] [00139.9] engine_redistribute: request 10 has error 'No MPI error'. [0000] [00139.9] engine_redistribute: request 11 has error 'No MPI error'. [0000] [00139.9] engine_redistribute: request 12 has error 'Undefined dynamic error code'. [0000] [00139.9] engine_redistribute: request 13 has error 'No MPI error'. [0000] [00139.9] engine_redistribute: request 14 has error 'Internal MPI error!, error stack: (unknown)(): hr'. [0000] [00139.9] engine_redistribute: request 15 has error 'No MPI error'.
is that what you get running with
-g
?Edited by James Willis