Skip to content
Snippets Groups Projects

Gravity splittask (non-MPI)

Merged Matthieu Schaller requested to merge gravity_splittask into master

This implements task splitting for gravity. Changes are:

  • Split large gravity tasks (self and pair)
  • Replace gravity pair tasks that are far enough with a multipole-multipole task.
  • Re-order operations in a time-step.

The last item has been a long-running accuracy issue as we were always running a time-step after detecting that a rebuild is needed.

The task-splitting is currently not enabled over MPI. I will make this work alongside the periodic gravity over MPI in the next merge request.

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • :unamused: A long-running EAGLE-25 job crashed with what looks like a random seg-fault....

    So, back to me...

  • added 1 commit

    • 05c71340 - Better naming convention for the function deciding whether a task can be split in the hydro case.

    Compare with previous version

  • added 1 commit

    • 5ad98d9d - Added functions to decide whether a gravity task associated to a given cell can be split.

    Compare with previous version

  • added 1 commit

    • 0e8d069d - Limit the depth of the gravity tasks being split.

    Compare with previous version

  • added 1 commit

    • 5e33f262 - Do not construct hierarchical gravity tasks below the level where there are any tasks.

    Compare with previous version

  • added 1 commit

    • 42f89f51 - Allow the user to tweak the maximal depth of gravity tasks.

    Compare with previous version

  • added 3 commits

    • eeed398d - More stream-lined version of engine_count_and_link()
    • 5d4530a7 - Link the MM task when creating them
    • 63bbb9be - Allocate the link tables before the call to scheduler_splittasks()

    Compare with previous version

  • added 2 commits

    • 8d949815 - Slight optimization in the linking of gravity tasks when MPI is switched off.
    • 004fd34d - Follow advice from the optimization report in scheduler_set_unlocks()

    Compare with previous version

  • added 1 commit

    • 733f2de2 - Do not recurse below the level where there are tasks when unskiping

    Compare with previous version

  • added 2 commits

    • 363dcbd5 - Revert "Follow advice from the optimization report in scheduler_set_unlocks()"
    • 94a05e03 - Added checks to try to catch the random crashes in action.

    Compare with previous version

  • added 1 commit

    Compare with previous version

  • added 1 commit

    • 88c04547 - Make the construction of self-gravity task an O(M^3) and not O(M^6) problem any more.

    Compare with previous version

  • added 1 commit

    • dc99b5bf - Use careful locks and unlocks around the activation of the gravity drifts.

    Compare with previous version

  • I have tried adding some locks around the offending task activation. I do not see why this would help but it seems to do so.

    Could you please have a look to see whether there is anything else suspicious in this merge request?

    Thanks!

  • I ran the EAGLE_50 test on a node of COSMA7 with the address sanitizer enabled and got this after 2000+ steps:

    ASAN:DEADLYSIGNAL
    =================================================================
    ==155620==ERROR: AddressSanitizer: SEGV on unknown address 0x2b53d748c96a (pc 0x0000004e01b8 bp 0x2b49b3acad30 sp 0x2b49b3acac50 T77)
    ==155620==The signal is caused by a WRITE memory access.
        #0 0x4e01b7 in scheduler_activate /cosma7/data/dp004/pdraper/swiftsim-check/src/scheduler.h:123
        #1 0x4e01b7 in cell_unskip_gravity_tasks /cosma7/data/dp004/pdraper/swiftsim-check/src/cell.c:2327
        #2 0x82f8be in runner_do_unskip_gravity /cosma7/data/dp004/pdraper/swiftsim-check/src/runner.c:981
        #3 0x82f8be in runner_do_unskip_mapper /cosma7/data/dp004/pdraper/swiftsim-check/src/runner.c:1008
        #4 0x5a1a6b in threadpool_chomp /cosma7/data/dp004/pdraper/swiftsim-check/src/threadpool.c:155
        #5 0x5a1a6b in threadpool_runner /cosma7/data/dp004/pdraper/swiftsim-check/src/threadpool.c:182
        #6 0x2b2a4e041e24 in start_thread (/lib64/libpthread.so.0+0x7e24)
        #7 0x2b2a4e34e34c in __clone (/lib64/libc.so.6+0xf834c)
    
    AddressSanitizer can not provide additional info.
    SUMMARY: AddressSanitizer: SEGV /cosma7/data/dp004/pdraper/swiftsim-check/src/scheduler.h:123 in scheduler_activate
    Thread T77 created by T0 here:
        #0 0x2b2a4a559060 in __interceptor_pthread_create ../../../../libsanitizer/asan/asan_interceptors.cc:243
        #1 0x5a262a in threadpool_init /cosma7/data/dp004/pdraper/swiftsim-check/src/threadpool.c:235
        #2 0x514bef in engine_config /cosma7/data/dp004/pdraper/swiftsim-check/src/engine.c:6105
        #3 0x4077d3 in main /cosma7/data/dp004/pdraper/swiftsim-check/examples/main.c:851
        #4 0x2b2a4e277c04 in __libc_start_main (/lib64/libc.so.6+0x21c04)
    
    ==155620==ABORTING
    

    haven't looked closely, but maybe that makes more sense to you...

  • Excellent, thanks Peter. That's exactly the location where I used to catch it as well. The confusing bit is that this address would have been correct just the line above.

    Do you think that re-ordering the tasks within the cell structure could help identify what is going wrong? By possibly hitting a "more invalid" address and hence see what is going wrong?

  • One obvious problem with that task is that it isn't reset in space_rebuild_recycle_mapper, so it could just be residual from the previous rebuild. I'll make that fix and try again.

  • added 1 commit

    • 3e375e19 - Reset grav_mesh task when recycling cells

    Compare with previous version

  • That's an interesting lead, thanks. The task I was crashing upon always was of the gravity_drift kind. But it could be that this one was correct whilst the mesh one was incorrectly assigned and dangling.

  • It takes a while to run, only up to step 1818, so no news yet...

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
Please register or sign in to reply
Loading