Gravity splittask (non-MPI)

8d949815 - Slight optimization in the linking of gravity tasks when MPI is switched off.
004fd34d - Follow advice from the optimization report in scheduler_set_unlocks()

Compare with previous version

added 1 commit

733f2de2 - Do not recurse below the level where there are tasks when unskiping

Compare with previous version

added 2 commits

363dcbd5 - Revert "Follow advice from the optimization report in scheduler_set_unlocks()"
94a05e03 - Added checks to try to catch the random crashes in action.

Compare with previous version

added 1 commit

237c50c9 - Typo in the check

Compare with previous version

added 1 commit

88c04547 - Make the construction of self-gravity task an O(M^3) and not O(M^6) problem any more.

Compare with previous version

added 1 commit

dc99b5bf - Use careful locks and unlocks around the activation of the gravity drifts.

Compare with previous version

I have tried adding some locks around the offending task activation. I do not see why this would help but it seems to do so.

Could you please have a look to see whether there is anything else suspicious in this merge request?

Thanks!

I ran the EAGLE_50 test on a node of COSMA7 with the address sanitizer enabled and got this after 2000+ steps:

ASAN:DEADLYSIGNAL
=================================================================
==155620==ERROR: AddressSanitizer: SEGV on unknown address 0x2b53d748c96a (pc 0x0000004e01b8 bp 0x2b49b3acad30 sp 0x2b49b3acac50 T77)
==155620==The signal is caused by a WRITE memory access.
    #0 0x4e01b7 in scheduler_activate /cosma7/data/dp004/pdraper/swiftsim-check/src/scheduler.h:123
    #1 0x4e01b7 in cell_unskip_gravity_tasks /cosma7/data/dp004/pdraper/swiftsim-check/src/cell.c:2327
    #2 0x82f8be in runner_do_unskip_gravity /cosma7/data/dp004/pdraper/swiftsim-check/src/runner.c:981
    #3 0x82f8be in runner_do_unskip_mapper /cosma7/data/dp004/pdraper/swiftsim-check/src/runner.c:1008
    #4 0x5a1a6b in threadpool_chomp /cosma7/data/dp004/pdraper/swiftsim-check/src/threadpool.c:155
    #5 0x5a1a6b in threadpool_runner /cosma7/data/dp004/pdraper/swiftsim-check/src/threadpool.c:182
    #6 0x2b2a4e041e24 in start_thread (/lib64/libpthread.so.0+0x7e24)
    #7 0x2b2a4e34e34c in __clone (/lib64/libc.so.6+0xf834c)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /cosma7/data/dp004/pdraper/swiftsim-check/src/scheduler.h:123 in scheduler_activate
Thread T77 created by T0 here:
    #0 0x2b2a4a559060 in __interceptor_pthread_create ../../../../libsanitizer/asan/asan_interceptors.cc:243
    #1 0x5a262a in threadpool_init /cosma7/data/dp004/pdraper/swiftsim-check/src/threadpool.c:235
    #2 0x514bef in engine_config /cosma7/data/dp004/pdraper/swiftsim-check/src/engine.c:6105
    #3 0x4077d3 in main /cosma7/data/dp004/pdraper/swiftsim-check/examples/main.c:851
    #4 0x2b2a4e277c04 in __libc_start_main (/lib64/libc.so.6+0x21c04)

==155620==ABORTING

haven't looked closely, but maybe that makes more sense to you...

Excellent, thanks Peter. That's exactly the location where I used to catch it as well. The confusing bit is that this address would have been correct just the line above.

Do you think that re-ordering the tasks within the cell structure could help identify what is going wrong? By possibly hitting a "more invalid" address and hence see what is going wrong?

One obvious problem with that task is that it isn't reset in space_rebuild_recycle_mapper, so it could just be residual from the previous rebuild. I'll make that fix and try again.

added 1 commit

3e375e19 - Reset grav_mesh task when recycling cells

Compare with previous version

That's an interesting lead, thanks. The task I was crashing upon always was of the gravity_drift kind. But it could be that this one was correct whilst the mesh one was incorrectly assigned and dangling.

It takes a while to run, only up to step 1818, so no news yet...

Gravity splittask (non-MPI)

Merge request reports

Activity