Skip to content
Snippets Groups Projects

Do not recurse to lower cells within the sub-tasks if they don't contain any particles.

Merged Matthieu Schaller requested to merge subtasks_only_for_cells_with_parts into master

This should fix #274 (closed). @jwillis could you please check that your setup that crashes works with this version ?

It builds on your proposed fix but I moved the decision of aborting higher up in the recursion.

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • If you are happy with it, just re-assign it to Peter for a final cross-check.

  • I merged your branch into mine, which also got me the latest master. Now the code fails a memory allocation when the sanitizer is switched on.

    I get the following error:

    ==3291== WARNING: AddressSanitizer failed to allocate 0x000214799833 bytes

    on line 646 of engine.c in engine_redistribute.

    While running: ../swift_mpi -s -g -a -e -t 16 eagle_25.yml -n 4096

  • I don't think it's related to the changes you have made

  • Does that mean that master crashes with the same issue ? Also, what happens if you switch off the sanitizer ? My recollection was that sanitizer + MPI was never a good combination.

  • Yeah, the master crashes as well. With:

    ==26231== WARNING: AddressSanitizer failed to allocate 0x000214799833 bytes
    =================================================================
    ==26231== ERROR: AddressSanitizer: unknown-crash on address 0x000000000000 at pc 0x2b6d2a5a054d bp 0x7ffefb67b460 sp 0x7ffefb67ac20
    WRITE of size 1053315968 at 0x000000000000 thread T0
        #0 0x2b6d2a5a054c in __interceptor_memcpy.part.13 /cosma/local/software/gcc/build/4.8.1/build/x86_64-unknown-linux-gnu/libsanitizer/asan/../../../../gcc-4.8.1/libsanitizer/asan/asan_interceptors.cc:288
        #1 0x44bed8 in engine_redistribute /cosma5/data/dp004/dc-will2/SWIFT/master/swiftsim/src/engine.c:636
        #2 0x406da0 in main /cosma5/data/dp004/dc-will2/SWIFT/master/swiftsim/examples/main.c:589
        #3 0x2b6d2e744d5c in __libc_start_main (/lib64/libc.so.6+0x1ed5c)
        #4 0x403c48 in _start (/cosma5/data/dp004/dc-will2/SWIFT/master/swiftsim/examples/swift_mpi+0x403c48)
    SUMMARY: AddressSanitizer: unknown-crash /cosma/local/software/gcc/build/4.8.1/build/x86_64-unknown-linux-gnu/libsanitizer/asan/../../../../gcc-4.8.1/libsanitizer/asan/asan_interceptors.cc:288 __interceptor_memcpy.part.13
    Shadow bytes around the buggy address:
      0x00007fff7fb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x00007fff7fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x00007fff7fd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x00007fff7fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x00007fff7ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    =>0x00007fff8000:[00]00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x00007fff8010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x00007fff8020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x00007fff8030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x00007fff8040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x00007fff8050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    Shadow byte legend (one shadow byte represents 8 application bytes):
      Addressable:           00
      Partially addressable: 01 02 03 04 05 06 07  
      Heap left redzone:     fa  
      Heap righ redzone:     fb  
      Freed Heap region:     fd  
      Stack left redzone:    f1  
      Stack mid redzone:     f2  
      Stack right redzone:   f3  
      Stack partial redzone: f4
      Stack after return:    f5  
      Stack use after scope: f8
      Global redzone:        f9
      Global init order:     f6  
      Poisoned by user:      f7  
      ASan internal:         fe
    ==26231== ABORTING

    I configured it with:

    ./configure CC="gcc" --disable-doxygen-doc --enable-debug --enable-sanitizer --disable-optimization --enable-debugging-checks --with-metis
  • I get the same problem when compiling without the santizer.

    engine.c:engine_redistribute():716: Failed during waitall for part data.
  • Ok, thanks. Means I pushed something bad last week or over the weekend... will sort it out..

  • One more qusetion: does it also crash without '-g'.

  • ok, so I f***ed up badly...

  • I will run with the sanitizer and see if it crashes in the same place without -g

  • perfect. running with in gdb could also give you a line number

  • For some reason, it doesn't give anything in the sanitizer and there's nothing in the .err file. I just get a bad termination error. I will try it in ddt

  • I get a divide by zero in runner_iact_nonsym_force because a particle has a smoothing length of zero. I think this is related to my issue: #274 (closed) even though gravity is turned off

  • how about when using platform mpi ?

  • Just trying that now. One thing though, you know how the check you put in for the particle counts is in DOSUB1+DOSUB2. The tasks themselves shouldn't be created should be created in the first place, right? So can the check not go in engine_makehydroloop_tasks as we have the cell particle counts from exchange_cells already. Or is it a problem with how we push the tasks down from the top level cells to all the progenies?

  • These tasks recurse so you can't check only at the level where the task is created.

    A task would not be created anyway if count is 0.

  • Ah I see, the DOSUB1 task is created and when it runs it recurses through its progenies

  • The problem I have now, is that after your changes in the subtasks_only_for_cells_with_parts I can't reproduce my old bug in my branch

  • With platform MPI, I get:

    runner.c:runner_do_recv_part():1411: Received un-drifted particle !

    after the third timestep.

  • Ok... That's even weirder.

  • The problem I have now, is that after your changes in the subtasks_only_for_cells_with_parts I can't reproduce my old bug in my branch

    Well that was the point of my fixes. :smile:

    Edited by Matthieu Schaller
  • Haha well, the fix I put in was at the DOPAIR1 level not the DOSUB1 level didn't seem to fix the problem and now when I undo your changes I still can't reproduce it.

    Edited by James Willis
  • I've ran the master again with the sanitizer turned off and it seems to run fine on one node with gravity off. I will try multiple nodes with and without gravity. I don't know what changed between when I started debugging that gravity problem and now, because I could run the sanitizer fine with MPI and not run into the same issue

  • Okay, so I've gone to 2 nodes and I get the undrifted particle error. The particle has an ID and position but nothing else not even a smoothing length. the ti_drift is set to 0, that's why the error occurs. So this is similar to my problem. I will look into the send tasks

  • What is the setup ? What configure flags ? What compiler ? What libraries ? What runtime parameters ?

  • Config. options: 'CC=gcc --disable-doxygen-doc --enable-debug --enable-debugging-checks --with-metis'
    
    Compiler: GCC, Version: 4.8.1
     CFLAGS  : '-g -O0  -gdwarf-2 -fvar-tracking-assignments -O3 -fomit-frame-pointer -malign-double -fstrict-aliasing -ffast-math -funroll-loops -march=corei7-avx -mavx  -Wall -Wextra -Wno-unused-parameter -Werror'
    
     HDF5 library version: 1.8.9
     MPI library: Intel(R) MPI Library 5.1.3 for Linux* OS (MPI std v3.0)
     METIS library version: 5.1.0
    
    mpirun -np 2 ../swift_mpi -s -a -e -t 16 eagle_25.yml -n 4096
    Edited by James Willis
  • mentioned in issue #283 (closed)

  • Tripped over (what looks like) this bug myself, so I've had a go at tracking where it originates. After a lot of disappointing bisections, I realised that our messy history doesn't really work well with that technique, so I tried a targeted revert instead and seems to have worked first time. After reverting b8f5850f (gravity multi-dt) my test now runs.

  • Could it be related to the changes in the task_lock and task_unlock functions ? The rest of the changes pushed there should only affect the content of gravity tasks.

  • James, if you revert to this point, does your test also succeed ?

  • My job is still waiting to run for the version that Peter suggested, but when I revert to v0.5.0 it runs fine

  • Ok, thanks. That's progress.

  • Also, cosma-4 is half empty.

  • Have you got a cosma-4 submission script I could borrow?

  • Re: reverting b8f5850f, realised I had missed out one configure option, I'd used previously, namely the sanitizer, so reconfigured with that and I get a new error about receiving undrifted particles in step 1. So a false dawn of some kind. BTW, undid --enable-sanitizer and the job does run again. Time to look farther afield.

  • Are we confident the sanitizer can be used in combination with MPI ?

  • @jwillis but you crash even without the sanitizer, no ?

  • I seem to recall that it go further without the sanitizer but I can't remember which revision I was running at the time

  • What setup are you running @pdraper?

  • Re: MPI and sanitizer. I've used it many times in the past without any issues, so, until now, I'd say yes.

  • Setup:

       swift/c5/gcc/intelmpi/5.1.3
       ./configure --with-metis --enable-debug --enable-debugging-checks
       mpirun -np 4 ../swift_mpi -a -t 16 -s -n 5000 -v 2 eagle_25.yml

    all on master. The initial problem I was seeing was a memory runaway as a recursive call just kept going...

    Edited by Peter W. Draper
  • Where was this runaway taking place ?

  • It was in cell_unpack. The values being used looked bogus (negative counts for instance), so this was just a side-effect of something else.

  • And that problem also goes away if I stop using the sanitizer. Hmm.

  • Is that with the latest version of the code @pdraper? Or the commit before b8f5850f?

  • Because mine crashes using the latest master and at commit 0bed4834 before the gravity merge. I use your setup as well:

    module load swift/c5/gcc/intelmpi/5.1.3
    ./configure --disable-doxygen-doc --enable-debug --enable-debugging-checks --with-metis
    mpirun -np 4 ../swift_mpi -a -e -t 16 -s -n 4096 eagle_25.yml

    apart from -e -v 2, but that shouldn't make a difference.

    Edited by James Willis
  • Ignore that last comment I can't get v0.5.0 to run now. I must be doing something wrong in the submission script

  • I'm using the HEAD of the current master and without the sanitizer it runs.

  • Running your setup with gravity I get the following errors:

    mlx4: local QP operation err (QPN 03bcd6, WQE index 1ae0000, vendor syndrome 70, opcode = 5e) 
    m5418:UCM:3a10:5383f800: 140466316 us(140466316 us!!!): DTO completion ERR: status 2, op Invalid DTO OP?, vendor_err 0x70 - 0.6.154.162
    [0000] [00139.9] engine_redistribute: request 0 has error 'No MPI error'.
    [0000] [00139.9] engine_redistribute: request 1 has error 'No MPI error'.
    [0000] [00139.9] engine_redistribute: request 2 has error 'No MPI error'.
    [0000] [00139.9] engine_redistribute: request 3 has error 'No MPI error'.
    [0000] [00139.9] engine_redistribute: request 4 has error 'No MPI error'.
    [0000] [00139.9] engine_redistribute: request 5 has error 'No MPI error'.
    [0000] [00139.9] engine_redistribute: request 6 has error 'No MPI error'.
    [0000] [00139.9] engine_redistribute: request 7 has error 'No MPI error'.
    [0000] [00139.9] engine_redistribute: request 8 has error 'No MPI error'.
    [0000] [00139.9] engine_redistribute: request 9 has error 'No MPI error'.
    [0000] [00139.9] engine_redistribute: request 10 has error 'No MPI error'.
    [0000] [00139.9] engine_redistribute: request 11 has error 'No MPI error'.
    [0000] [00139.9] engine_redistribute: request 12 has error 'Undefined dynamic error code'.
    [0000] [00139.9] engine_redistribute: request 13 has error 'No MPI error'.
    [0000] [00139.9] engine_redistribute: request 14 has error 'Internal MPI error!, error stack:
    (unknown)(): hr'.
    [0000] [00139.9] engine_redistribute: request 15 has error 'No MPI error'.

    is that what you get running with -g?

    Edited by James Willis
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
Please register or sign in to reply
Loading