Skip to content
Snippets Groups Projects

Draft: Sink MPI: Let's add the physics

Open Darwin requested to merge darwin/sink_mpi_physics into master
4 unresolved threads

In this MR, I implemented the MPI sink tasks.

@matthieu @jdavies, here is the MR for the sink MPI. I'm starting from scratch, but here we can interact about the MPI, and you can follow my progress.

TODO

  • Implement sink formation
  • Implement sink self/pair tasks (density, swallow, do_gas_swallow and do_sink_swallow)
  • Separate SF sinks from SF
  • Make everything work
  • Debug
  • Update the doc
  • Take care of gpart data access in the Basic model
Edited by Darwin

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Darwin added MPI sink labels

    added MPI sink labels

  • assigned to @matthieu

  • Darwin added 3 commits

    added 3 commits

    • 9ffd8fbb - Add sink tasks in engine.c
    • d477012d - Implement do_sink/gas_swallow MPI in runner_sinks.h
    • bc5df723 - Add a parameter in GEAR sink to disable SF

    Compare with previous version

  • Darwin added 16 commits

    added 16 commits

    • bc5df723...edfe2e62 - 6 earlier commits
    • 21e31db4 - Update engine_addtasks_recv_mapper with the sinks
    • 7307580f - Fix misnamed tasks
    • caa1f293 - Mark allocated foreign sinks as not swallowed by sinks when allocating space
    • aa9ede01 - Implement runner_recv_sinks
    • f948c7b8 - Implement runner_recv_sinks
    • f4c963fa - Add sinks to scheduler.c
    • 0bfc4e72 - Add sink mpi tasks in runner_main.c
    • cf99f3a0 - Disable sink formation and sf_sink for now
    • 333ac63f - Add sf_sinks gravity counts dependency with the standard sf t_grav_counts
    • e2cc9088 - Change HomogeneousBoxSinkParticles for debugging

    Compare with previous version

  • Darwin added 7 commits

    added 7 commits

    • 128232d3 - Change HomogeneousBoxSinkParticles for debugging part 2
    • 7677f6d5 - Remove sink MPI TODO errors in cell_unskip.c
    • 440e7289 - Improve cell_unskip_sinks_tasks()
    • 5f002ef9 - Add todos comment where we need to implement sf_sinks tasks in cell_unskip.c
    • 3cac42dd - Activate sink ghosts tasks on local and foreign nodes
    • 0177ac46 - Unskip sink MPI tasks similarly to BH (without sink/star formation)
    • 307942a8 - Update the estimate number of sink tasks to include the MPI tasks

    Compare with previous version

  • Author Developer

    I pushed a first draft of the MPI implementation.

    • I disabled sink and star formation for now. It's easier to debug with a fixed number of sinks.
    • However, I started implementing the MPI for sink/star formation. It's not supposed to be used in the current state.
    • The density, swallow, do_gas_swallow, and do_sink_swallow are implemented.

    Now, I can start testing and debugging :)

  • Author Developer

    The current version seems to work without gravity. With gravity, I have to take care of the sinks' calls to the gparts' softening length, which is not MPI-friendly when the sink is foreign and thus crashes.

  • Darwin added 5 commits

    added 5 commits

    • a820b637 - Update plot_task_dependencies.py to properly color the sink MPI rho task
    • c1ffef62 - Fix typo
    • 6d5c39a0 - Increase links_per_tasks in HomogeneousBoxSinkParticles example
    • f687f7d3 - Remove test line
    • e85f441a - Only do swallow interaction on local node

    Compare with previous version

  • Darwin marked the checklist item Implement sink self/pair tasks (density, swallow, do_gas_swallow and do_sink_swallow) as completed

    marked the checklist item Implement sink self/pair tasks (density, swallow, do_gas_swallow and do_sink_swallow) as completed

  • Author Developer

    @matthieu I'm thinking about star formation. Since we can have SF and SF_sink enabled simultaneously, how should we handle the task_subtype_sf_counts and task_subtype_grav_counts?

    I see three possibilities:

    1. Have separate tasks for sf_counts and sf_sink_counts, and for grav_counts and sf_sink_grav_counts.
    2. Have separate tasks for sf_counts and sf_sink_counts, but keep the same task for grav_counts. We would have grav_counts depending on both sf_counts and sf_sink_counts.
    3. Have a single task for the two SF methods.

    Which one do you think is best?

  • Darwin added 5 commits

    added 5 commits

    • e85f441a...4ea5850e - 2 commits from branch master
    • 87e1d4bc - Implement send and recv gravity tasks for sink_formation and sf_sink
    • 895173a5 - Fix star_formation --> star_formation_sink when unlocking sf_sink tasks
    • ff39a394 - Merge with master + fix conflict

    Compare with previous version

  • Darwin added 9 commits

    added 9 commits

    • 13adcaf0 - Unskip sf_sink MPI tasks
    • df3572e5 - Change with_star_formation to with_star_formation_sink
    • aeb7a9f6 - Unskip sf_gav_counts for sf_sink
    • c90755e4 - Only print parameter reading for rank 0
    • 4bb329cc - Remove error message when running with sinks+MPI in send_stars tasks
    • 5d64a711 - Add missing dependencies between recv_rho/recv_xv (hydro) and sink_density loop
    • ba58a6c9 - Comment sink warning message when there are no neighbours
    • 2314cc14 - Replace gravity_get_softening() by sink_get_softening()
    • 4177a7aa - Remove the local swallow interactions

    Compare with previous version

  • Darwin added 5 commits

    added 5 commits

    • f8144964 - Update with_star_formation --> with_star_formation_sink
    • 75987519 - Verify hydro.counts !=0 in runner_doiact_fcuntion_sinks.h
    • bc86923e - Add sink_ghost1 --> do_gas_swallow dependency on foreign node
    • 7c08838d - Allow star formation sink
    • 981468b1 - Print do sink/gas swallow messages in debugging check mode

    Compare with previous version

  • Darwin marked the checklist item Implement sink formation as completed

    marked the checklist item Implement sink formation as completed

  • Darwin added 7 commits

    added 7 commits

    • 981468b1...9814b234 - 2 commits from branch master
    • 2f0c061c - Add mpi_run.sh file in HomogeneousBox example
    • 981beba8 - Unskip sink formation MPI tasks (sink + gravity)
    • 7128eda9 - Allow sink formation in runner_others.c
    • cca60af4 - Comment temporarily proxies messages
    • 3e89d969 - Merge branch 'master' into darwin/sink_mpi_physics

    Compare with previous version

  • Darwin added 6 commits

    added 6 commits

    • 09de97b9 - Remove t_sink_formation_grav_count task
    • 33b268fa - Remove cell_un/pack_sink_formation_grav_counts() fcts and struct
    • ebaa951e - Rework the gravity send MPI dependencies between sink_formation, SF, SF_sink and grav_counts tasks
    • 9c5d9b54 - Rework the gravity recv MPI dependencies for grav_counts tasks
    • 685add96 - Add a TODO in cell_pack.c to rename the "pcell_sf_grav" struct
    • b0f26404 - Rework the gravity MPI cell_unskip tasks activation for SF, SF_sink and sink_formation

    Compare with previous version

  • Author Developer

    Here is an update.

    I changed how I implemented the new sink and star gravity counts tasks. Now, we have only one send and recv gravity count and the SF, SF_sink and sink_formation dependencies.

    I changed the logic because I encountered crashes with the old way (in the HomogeneousBox level 5). After thinking a bit, I understood that the send/recv_grav_counts are agnostic of whether the new gparts come from sink or star formation. Thus, there was no reason to complicate the code and have functions that do the same thing but under different names and structs (with the same content).

    The new implementation seems to work better. I have not encountered problems with the HomogeneousBox at level 5. I will move to level 8 and test on a cluster with more MPI ranks.

  • Darwin added 11 commits

    added 11 commits

    Compare with previous version

    • Author Developer

      Update: I'm trying to run the level 8 (~ 16 million particles) HomogeneousBox on our cluster. I encountered the following problem with parameters at the beginning of the simulation.

      Input Error: Incorrect sum of 0.000000 for tpwgts for constraint 0.
      Input Error: Incorrect tpwgts for partition 3 and constraint 0.
      Input Error: Incorrect sum of 0.000000 for tpwgts for constraint 0.
      Input Error: Incorrect sum of 0.000000 for tpwgts for constraint 0.
      [bonsai02:922573:0:922573] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xec623bf4)
      [bonsai02:922574:0:922574] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xbab3ebc8)
      ==== backtrace (tid: 922574) ====
       0 0x0000000000054db0 __GI___sigaction()  :0
       1 0x0000000000045d1d libparmetis__PartitionSmallGraph()  ???:0
       2 0x000000000002390e ParMETIS_V3_PartKway()  ???:0
       3 0x00000000004e242d pick_parmetis()  /home/astro/roduit/program/swiftsim_2_debug/src/partition.c:1060
       4 0x00000000004eac24 partition_initial_partition()  /home/astro/roduit/program/swiftsim_2_debug/src/partition.c:2001
       5 0x00000000004537aa engine_split()  /home/astro/roduit/program/swiftsim_2_debug/src/engine.c:3104
       6 0x000000000040bc58 main()  /home/astro/roduit/program/swiftsim_2_debug/swift.c:1622
       7 0x000000000003feb0 __libc_start_call_main()  ???:0
       8 0x000000000003ff60 __libc_start_main_alias_2()  :0
       9 0x000000000040e645 _start()  ???:0
      =================================

      Also note that in src/partition.c, we need to add t->type == task_type_star_formation_sink || for the tasks weights in check_weights() and partition_gather_weights().

    • If that error persists, I recommend to post some details in the mpi slack channel to get some hints to progress.

    • Author Developer

      Thanks for the suggestion. I forgot this channel!

      For now, I got it working using only Metis.

    • Please register or sign in to reply
  • Darwin added 7 commits

    added 7 commits

    Compare with previous version

  • Author Developer

    Update + notes for me: I got the Homogeneous box example running on our cluster.

    Particle shift diff

    I encountered:

    [0002] [05966.4] runner_doiact_functions_stars.h:runner_dopair_branch_stars_feedback():1293: particle shift diff exceeds dx_max_sort in cell cj. cj->nodeID=7 ci->nodeID=2 d=6.515264e+00 sort_j[pjd].d=0.000000e+00 cj->stars.dx_max_sort=0.000000e+00 cj->stars.dx_max_sort_old=0.000000e+00, cellID=162922513 super->cellID=3538961cj->depth=4 cj->maxdepth=7

    The ones for hydor+limiter/synchronisation are commented, given they often trigger with sinks. I need to comment this check, too.

    Gravity

    More related to sinks, I got after 1h30-2h of run:

    [0006] [06705.9] runner_recv.c:runner_do_recv_gpart():198: Received a cell at an incorrect time c->ti_end_min=0, e->ti_current=4325478743670784.

    and

    [0000] [05899.3] runner_doiact_grav.c:runner_dopair_grav_pp():1234: Un-drifted gparts

    I tried:

    • with GEAR SF and this branch: same errors
    • with GEAR SF and master (yesterday's morning version): no error

    Next steps

    • Print the cell IDs to track the cell's task dependencies
    • Compare this branch with master to find out the problem
    • Improve check in recv for sf_sink (I left it as a TODO...)
  • Darwin changed the description

    changed the description

  • Darwin added 15 commits

    added 15 commits

    Compare with previous version

  • Darwin added 2 commits

    added 2 commits

    • 1f077a2f - Update runner_do_recv_spart() to add with_sf_sink for the debugging check
    • f37651e3 - Merge branch 'darwin/sink_mpi_physics' of...

    Compare with previous version

  • Darwin added 1 commit

    added 1 commit

    • 8148a2ca - Convert r_cut to h in proxy.c and runner_recv.c

    Compare with previous version

  • Darwin added 5 commits

    added 5 commits

    • 2f4bb6e2 - Remove unused comment
    • 2075c302 - Amend comments in the use of pcell struct involving stars and sinks
    • 0b72b6a8 - Remove extra empty case for task_subtype_limiter
    • 9de19bb9 - Improve comments about adding sf_sinks_counts in the future
    • cd840aee - Amend comments for clarity in scheduler.c

    Compare with previous version

  • Darwin added 6 commits

    added 6 commits

    Compare with previous version

  • Darwin added 4 commits

    added 4 commits

    • 6cd3ea97 - Improve t_grav_counts task creation and dependencies for sinks/stars
    • d29b76fa - Link sf, sf_sinks and sink_formation only once for the three tasks
    • 51f97a02 - Use the top cell's count to add the dependency t_grav --> X_formation
    • 22f98136 - Cleaning

    Compare with previous version

  • Darwin added 19 commits

    added 19 commits

    • 22f98136...b585e17e - 16 commits from branch master
    • 5d6a6b86 - Merge branch 'master' into darwin/sink_mpi_physics
    • 1f7c7264 - Rework cell_unskip.c to unskip sink tasks on foreign nodes to not miss any dependencies
    • 3ea99e69 - Add sink_ghost2 dependcy on the foreign cell to not miss any dependencies

    Compare with previous version

  • Author Developer

    Hi everyone, I'm sorry for the late update. I was busy with my PhD, but this gave me time to reflect on the causes of various problems.

    I turned off SF to ensure that the other dependencies were okay. It turned out that we still have problems with "undrifted part" and also with "sink was not swallowed." I inspected the cell dependencies, and they are sometimes broken. For instance, do_gas_sink_swallow is not linked to sink_swallow. This often happens because do_gas_swallow or sink_ghost2 were not activated. This convinced me that the task dependencies were fine in engine_maketasks.c. This is probably a task activation problem.

    Since the task activation was mainly based on BHs and we do not create BHs on the fly, I had a look at the stars. (Remember that sinks are a mix of BHs and stars) Currently, I’m checking task activation for foreign cells to ensure they do not have broken dependencies because some tasks were not activated.

    • Great. Yes, the on-the-fly creation makes things more complicated.

      Could it help to run a scenario where there are sinks in the ICs but none are created during the run?

    • Author Developer

      I will retry this example. It worked months ago but given the changes I made, it's worth trying again.

      Thanks for the suggestion!

    • Please register or sign in to reply
  • Darwin added 1 commit

    added 1 commit

    • 6c5a24f1 - Add dependencies send_sink_rho --> send_sink_gas_swallow --> send_sink_merger...

    Compare with previous version

  • Author Developer

    I've just added dependencies between the sink send tasks to send things in an ordered manner when the interaction tasks are not created. This is similar to the gas send tasks.

    These changes + the previous ones in cell_unskip.c made the code run for longer, ~ 12 hours instead of 3. That's an improvement!

    Notice that I disabled SF for now since it does not even work for SF runs (i.e. with --star-formation)

    Edited by Darwin
    • Author Developer

      There is also this weird behaviour of star_formation_sink that can stars before we actually do sink_formation in the top level cells.

      dependency_graph_cell_18_step_0_rank_3

    • Author Developer

      Mea culpa: the previous graph is fine. Here is the non-mpi equivalent for the same cell.

      dependency_graph_cell_18_step_2365_rank_0

    • Please register or sign in to reply
  • Author Developer

    A few notes for me later:

    • I need to print multiple cell dependencies at the same time so that I can have the top cells, the grav and hydro super at the same time.
    • Can I write the dependencies when an error is triggered? Probably yes, by modifying scheduler_write_cell_dependencies() to take a cellID and call it before throwing an error. This would greatly help in finding the dependencies/task activation errors. [16.04.2025: Answer: Yes this can be done and works extremely well!]

    Here are three (working) dependency graphs for cell 4063249 (hydro super) for 0 and 3 step 592, rank 0 step 594.

    dependency_graph_cell_4063249_step_592_rank_0

    dependency_graph_cell_4063249_step_592_rank_3

    dependency_graph_cell_4063249_step_594_rank_0

    Edited by Darwin
  • Darwin added 23 commits

    added 23 commits

    Compare with previous version

  • Author Developer

    This is an update with good news. Thanks to point 2 in my previous post, I found the dependency/task activation error that generated my "undrafted sink" in sink_swallow.

    I added a scheduler_write_cell_dependencies_debug() (to be committed) that writes the dependencies for the current cell, the hydro super, the grav super, and the top cells. Then, I call it before throwing an error. This is much more efficient than choosing a cell beforehand and praying to the programming gods that this cell will have problems.

    The following dependency graph shows this cell is missing a sink_density task. I don't know if this cell is local or foreign, but we are missing sink_in, sink_density, sink_density_ghost, and sink_density_send/recv. Notice that the stars have the stars_in and density tasks. dependency_graph_cell_2097157_step_787_rank_2

    The next step is to have a look at the sink density task activation. This one is responsible for the activation of the missing tasks I mentioned.

  • Darwin added 3 commits

    added 3 commits

    • 8fafbe1f - Improve scheduler_write_cell_dependencies() to pass the cellID and add...
    • ed25d7be - Propagate the new scheduler_write_cell_dependencies() fct definition to engine.c
    • 9b424893 - Debugging tasks dependencies in swallow_sink (sink swallow sink)

    Compare with previous version

  • Good detective work!

  • Author Developer

    Update: the density sub-pair tasks are not activated because they are NULL

    --> Are they created in engine_maketasks ?

  • Author Developer

    Hey everyone! Here is an update.

    I finally managed to restart with MPI. (I had a mismatch for some memory data structures for unknown reasons...). Now I have a ~ 5 minutes waiting time to debug! Debugging will be much faster.

    For my previous post, I read the wrong line; the density tasks are created. However, they are flagged as skipped... :thinking:

    But, good news! The bug is reproducible and happens at the same timestep.

    However, the cells can be different. At this timestep, I believe that a set of cells have their density tasks flagged as skipped, and at each restart, I find a different subset of these cells.

    Next steps: understand why/how/where these tasks are flagged as skipped.

    As a note, I will be on holiday (without laptop for once ;) from the 8th of May to the 17th. With a clear mind, debugging will be better!

Please register or sign in to reply
Loading