Skip to content
Snippets Groups Projects

Split the drift into hydro and gravity drift tasks. Implement FFT task and dependencies

Merged Matthieu Schaller requested to merge gravity_multi_dt into master

Changes:

  • Two drift tasks, one pert particle type.
  • FFT task with gravity ghost tasks for the dependencies.

It passes my usual tests but could you make sure it does not break anything ? That should be the last big change. The rest of gravity should be contained within the tasks themselves.

Merge request reports

Merged by avatar (May 28, 2025 3:31am UTC)

Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Added 1 commit:

    • 0a0a27c4 - Do not compile FFTW code if FFTW is not present.
  • Added 1 commit:

    • 28123d7d - Also add the new task types to the partition decision making.
  • Didn't get far... Just built using:

    ./configure --with-metis --enable-debug --enable-debugging-checks

    with GCC 5.4.0 and ran the standard SodShock test. That core dumped at:

    Program terminated with signal SIGSEGV, Segmentation fault.
    #0  0x000000000042767f in scheduler_activate (t=0x0, s=0x7ffd2634e8f0) at scheduler.h:116
    116       if (atomic_cas(&t->skip, 1, 0)) {
    [Current thread is 1 (Thread 0x2acca6e9b5c0 (LWP 3743))]
    (gdb) bt
    #0  0x000000000042767f in scheduler_activate (t=0x0, s=0x7ffd2634e8f0) at scheduler.h:116
    #1  engine_unskip (e=0x7ffd2634e8d0) at engine.c:3483
    #2  engine_prepare (e=e@entry=0x7ffd2634e8d0) at engine.c:2831
    #3  0x000000000042ae55 in engine_step (e=e@entry=0x7ffd2634e8d0) at engine.c:3349
    #4  0x0000000000403bec in main (argc=<optimised out>, argv=<optimised out>) at main.c:628

    Seems that grav_top_level is NULL at:

    #1  engine_unskip (e=0x7ffd2634e8d0) at engine.c:3483
    3483      if (e->s->periodic) scheduler_activate(&e->sched, e->s->grav_top_level);
  • Added 1 commit:

    • d92cb31d - Do not activate the FFT task in situations where it should not exist
  • Thanks. That should now be fixed.

  • test125cells causes a floating point exception when running:

    ./test125cells -n 6 -r 100 -v 1 -p 2
  • It crashes on line 168 of equation_of_state.h

  • What configuration ?

  • ./configure --disable-vec, but I'm sure it crashes with a normal ./configure as well. With the Intel compiler.

  • Looks like there is another issue. Running the EAGLE_12 test case with debugging checks enabled I eventually see :

    drift.h:drift_part():86: particle has not been drifted to the current time p->ti_drift=2597046464806912, c->ti_old=2594847441551360, ti_current=2597046464806912

    the ticks vary between the reports as does the exact step (usually need at least a thousand steps, but can need a few thousand). Seen with GCC 5.4.0 and 4.8.5, with and without vectorisation.

    Tried repeating it in master, but that didn't turn up the issue.

  • Alright... :cry:

  • Matthieu Schaller Added 7 commits:

    Added 7 commits:

    • d92cb31d...33476206 - 4 commits from branch master
    • 8d8be86e - Correctly initialise the engine in the FFT test.
    • e56eaf3d - Merge branch 'master' into gravity_multi_dt
    • 7d68821a - Reset the particles for every run in test125.
  • @jwillis that should fix the bug you reported. You can cherry-pick commit 7d68821a into your branch if that's blocking you.

  • Okay thanks, I will try it out

  • Here's a back trace from another occurrence of that drift bug:

    #0  0x00002b6d77677428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
    #1  0x00002b6d7767902a in __GI_abort () at abort.c:89
    #2  0x000000000041b96e in drift_part (ti_current=1035739953364992, ti_old=1033540930109440, timeBase=6.9388939039072285e-20, dt=1.52587887e-07, 
        xp=0x2b6dcc42db00, p=0x2b6db2287ac0) at drift.h:83
    #3  cell_drift_part (c=0x2b6df5177e00, e=0x7ffd71592fe0) at cell.c:1567
    #4  0x000000000041b737 in cell_drift_part (c=0x2b6ddc1d0e80, e=0x7ffd71592fe0) at cell.c:1548
    #5  0x0000000000465f82 in runner_dosub_subset_density (r=0xc3fce0, ci=0x2b6de8a9c100, parts=0x2b6db1358ae0, ind=0x2b6f83774d40, count=6, cj=0x2b6ddc1d0e80, 
        sid=-1, gettimer=0) at runner_doiact.h:3221
    #6  0x0000000000461545 in runner_dosub_subset_density (r=0xc3fce0, ci=0x2b6dd5c5a700, parts=0x2b6db1358ae0, ind=0x2b6f83774d40, count=6, cj=0x2b6df9690080, 
        sid=1, gettimer=1) at runner_doiact.h:2712
    #7  0x000000000048d1f3 in runner_do_ghost (r=0xc3fce0, c=0x2b6ded363280, timer=0) at runner.c:768
    #8  0x000000000048c4a5 in runner_do_ghost (r=0xc3fce0, c=0x2b6df0ea6e00, timer=0) at runner.c:640
    #9  0x000000000048c4a5 in runner_do_ghost (r=0xc3fce0, c=0x2b6de8a9c100, timer=0) at runner.c:640
    #10 0x000000000048c4a5 in runner_do_ghost (r=0xc3fce0, c=0x2b6dd5c5a700, timer=0) at runner.c:640
    #11 0x000000000048c4a5 in runner_do_ghost (r=0xc3fce0, c=0x2b6e049ff780, timer=0) at runner.c:640
    #12 0x000000000048c4a5 in runner_do_ghost (r=0xc3fce0, c=0x2b6dfcaef300, timer=1) at runner.c:640
    #13 0x00000000004933b7 in runner_main (data=0xc3fce0) at runner.c:1879
    #14 0x00002b6d7667d6ba in start_thread (arg=0x2b6dd0ccf700) at pthread_create.c:333
    #15 0x00002b6d7774882d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
  • and a second back trace (different OS to above), which is quite different:

    #0  0x00002ae40e6a41d7 in raise () from /lib64/libc.so.6
    #1  0x00002ae40e6a58c8 in abort () from /lib64/libc.so.6
    #2  0x0000000000420242 in drift_part (timeBase=<optimized out>, ti_current=<optimized out>, ti_old=1974722883485696, dt=<optimized out>, xp=<optimized out>, p=<optimized out>)
        at drift.h:83
    #3  cell_drift_part (c=c@entry=0x2ae4a9297680, e=e@entry=0x7fff14f80e20) at cell.c:1567
    #4  0x000000000041fc0c in cell_drift_part (c=0x2ae4893ab880, e=e@entry=0x7fff14f80e20) at cell.c:1548
    #5  0x0000000000463227 in runner_dosub_pair1_density (r=0x22f2b30, ci=0x2ae474aa1000, cj=0x2ae4893ab880, sid=12, gettimer=<optimized out>) at runner_doiact.h:2283
    #6  0x0000000000463ce7 in runner_dosub_pair1_density (r=0x22f2b30, ci=0x2ae494681080, cj=0x2ae478457f00, sid=12, gettimer=<optimized out>) at runner_doiact.h:2242
    #7  0x0000000000463ce7 in runner_dosub_pair1_density (r=0x22f2b30, ci=0x2ae4843f9480, cj=0x2ae47496c800, sid=12, gettimer=<optimized out>) at runner_doiact.h:2242
    #8  0x0000000000476754 in runner_main (data=0x22f2b30) at runner.c:1859
    #9  0x00002ae40e45adc5 in start_thread () from /lib64/libpthread.so.0
    #10 0x00002ae40e76673d in clone () from /lib64/libc.so.6
  • 223 227 else {
    228
    224 229 lock_lock(&ci->lock);
    225 if (ci->drift == NULL)
    226 ci->drift = scheduler_addtask(s, task_type_drift, task_subtype_none,
    227 0, 0, ci, NULL);
    230
    231 /* Drift gparts case */
    232 if (t->subtype == task_subtype_grav && ci->drift_gpart == NULL)
    233 ci->drift_gpart = scheduler_addtask(
    234 s, task_type_drift_gpart, task_subtype_none, 0, 0, ci, NULL);
    235
    236 /* Drift parts case */
    237 else if (t->subtype == task_subtype_density && ci->drift_part == NULL)
    238 ci->drift_part = scheduler_addtask(s, task_type_drift_part,
    239 task_subtype_none, 0, 0, ci, NULL);
    • The logic here is also changed. We seem to require two subtypes before scheduling a drift. Are you sure the else if shouldn't just be a check for NULL.

  • 1186 1208 }
    1187 1209 #endif
    1188 1210
    1211 scheduler_print_tasks(s, "tasks.dat");
    1212
  • Matthieu Schaller Added 4 commits:

    Added 4 commits:

    • 271f400c - Unwanted include
    • 2ff622ff - Move the sid_scale definition to sort_part.h
    • 43ec3bd8 - Do not print the task list every time-step
    • f99ac831 - Separate the task splitting procedure between the hydro and the gravity tasks
  • Unable to load the diff
  • Sorry, I pushed things but only wanted to commit at this stage. I am working on splitting the splitting of tasks between a hydro and a gravity version. That should prevent logical errors.

  • Unable to load the diff
    • And the second check of that also reproduces the bug:

      [03734.3] drift.h:drift_part():86: particle has not been drifted to the current time p->ti_drift=28516933577998336, c->ti_old=28508137484976128, ti_current=28516933577998336

      I'll open an issue for that.

  • mentioned in issue #308 (closed)

  • Added 1 commit:

    • d116018c - Implemented a first version of the splittask routine for gravity tasks.
  • I think I now have a version where the logic is correct or at least not more wrong than in master.

  • Peter W. Draper Added 1 commit:

    Added 1 commit:

    • d116018c - Implemented a first version of the splittask routine for gravity tasks.
  • Thanks, in the first test this has ran to completion, so I'll try that again.

  • And that failed after 2173 steps.

  • This second crash is the same issue as in master, right ? Or am I missing something here ?

  • Yes, same issue. The only thing that worries me about this branch is that the error seems to occur more frequently than master, otherwise I'd accept and we could move on. I'll try some more tests.

  • Peter W. Draper Added 1 commit:

    Added 1 commit:

    • f866d555 - Use task_subtype_none as subtype is expected
  • mentioned in issue #303 (closed)

  • Which also failed at various times with the same error. A larger MPI test reported:

    [0001] [00642.6] drift.h:drift_part():86: particle has not been drifted to the current time p->ti_drift=5655887813279744, c->ti_old=5651489766768640, ti_current=5655887813279744
    application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
    [0001] [00660.6] runner_doiact_vec.c:runner_dopair1_density_vec():668: particle shift diff exceeds dx_max_sort.
    application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
    [0000] [02753.9] runner_doiact_vec.c:runner_dopair1_density_vec():668: particle shift diff exceeds dx_max_sort.
    application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
    [0002] [01500.2] drift.h:drift_part():86: particle has not been drifted to the current time p->ti_drift=22940210601918464, c->ti_old=22799473113563136, ti_current=22940210601918464
    application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
    [0002] [00154.2] drift.h:drift_part():86: particle has not been drifted to the current time p->ti_drift=870813209198592, c->ti_old=868614185943040, ti_current=870813209198592
    application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2

    which you'd guess is the same problem in a new guise. I'll accept this merge request for now as I don't think these are caused by this patch.

  • Peter W. Draper Status changed to merged

    Status changed to merged

  • Peter W. Draper mentioned in commit e7c76035

    mentioned in commit e7c76035

  • mentioned in issue #26 (closed)

  • Please register or sign in to reply
    Loading