Draft: Sink MPI: Let's add the physics
In this MR, I implemented the MPI sink tasks.
@matthieu @jdavies, here is the MR for the sink MPI. I'm starting from scratch, but here we can interact about the MPI, and you can follow my progress.
TODO
-
Implement sink formation -
Implement sink self/pair tasks ( density
,swallow
,do_gas_swallow
anddo_sink_swallow
) -
Separate SF sinks from SF -
Make everything work -
Debug -
Update the doc -
Take care of gpart data access in the Basic
model
Merge request reports
Activity
assigned to @matthieu
added 16 commits
- bc5df723...edfe2e62 - 6 earlier commits
- 21e31db4 - Update engine_addtasks_recv_mapper with the sinks
- 7307580f - Fix misnamed tasks
- caa1f293 - Mark allocated foreign sinks as not swallowed by sinks when allocating space
- aa9ede01 - Implement runner_recv_sinks
- f948c7b8 - Implement runner_recv_sinks
- f4c963fa - Add sinks to scheduler.c
- 0bfc4e72 - Add sink mpi tasks in runner_main.c
- cf99f3a0 - Disable sink formation and sf_sink for now
- 333ac63f - Add sf_sinks gravity counts dependency with the standard sf t_grav_counts
- e2cc9088 - Change HomogeneousBoxSinkParticles for debugging
Toggle commit listadded 7 commits
- 128232d3 - Change HomogeneousBoxSinkParticles for debugging part 2
- 7677f6d5 - Remove sink MPI TODO errors in cell_unskip.c
- 440e7289 - Improve cell_unskip_sinks_tasks()
- 5f002ef9 - Add todos comment where we need to implement sf_sinks tasks in cell_unskip.c
- 3cac42dd - Activate sink ghosts tasks on local and foreign nodes
- 0177ac46 - Unskip sink MPI tasks similarly to BH (without sink/star formation)
- 307942a8 - Update the estimate number of sink tasks to include the MPI tasks
Toggle commit listI pushed a first draft of the MPI implementation.
- I disabled sink and star formation for now. It's easier to debug with a fixed number of sinks.
- However, I started implementing the MPI for sink/star formation. It's not supposed to be used in the current state.
- The
density
,swallow
,do_gas_swallow
, anddo_sink_swallow
are implemented.
Now, I can start testing and debugging :)
@matthieu I'm thinking about star formation. Since we can have SF and SF_sink enabled simultaneously, how should we handle the
task_subtype_sf_counts
andtask_subtype_grav_counts
?I see three possibilities:
- Have separate tasks for
sf_counts
andsf_sink_counts
, and forgrav_counts
andsf_sink_grav_counts
. - Have separate tasks for
sf_counts
andsf_sink_counts
, but keep the same task forgrav_counts
. We would havegrav_counts
depending on bothsf_counts
andsf_sink_counts
. - Have a single task for the two SF methods.
Which one do you think is best?
- Have separate tasks for
added 5 commits
-
e85f441a...4ea5850e - 2 commits from branch
master
- 87e1d4bc - Implement send and recv gravity tasks for sink_formation and sf_sink
- 895173a5 - Fix star_formation --> star_formation_sink when unlocking sf_sink tasks
- ff39a394 - Merge with master + fix conflict
Toggle commit list-
e85f441a...4ea5850e - 2 commits from branch
added 9 commits
- 13adcaf0 - Unskip sf_sink MPI tasks
- df3572e5 - Change with_star_formation to with_star_formation_sink
- aeb7a9f6 - Unskip sf_gav_counts for sf_sink
- c90755e4 - Only print parameter reading for rank 0
- 4bb329cc - Remove error message when running with sinks+MPI in send_stars tasks
- 5d64a711 - Add missing dependencies between recv_rho/recv_xv (hydro) and sink_density loop
- ba58a6c9 - Comment sink warning message when there are no neighbours
- 2314cc14 - Replace gravity_get_softening() by sink_get_softening()
- 4177a7aa - Remove the local swallow interactions
Toggle commit listadded 5 commits
- f8144964 - Update with_star_formation --> with_star_formation_sink
- 75987519 - Verify hydro.counts !=0 in runner_doiact_fcuntion_sinks.h
- bc86923e - Add sink_ghost1 --> do_gas_swallow dependency on foreign node
- 7c08838d - Allow star formation sink
- 981468b1 - Print do sink/gas swallow messages in debugging check mode
Toggle commit listadded 7 commits
-
981468b1...9814b234 - 2 commits from branch
master
- 2f0c061c - Add mpi_run.sh file in HomogeneousBox example
- 981beba8 - Unskip sink formation MPI tasks (sink + gravity)
- 7128eda9 - Allow sink formation in runner_others.c
- cca60af4 - Comment temporarily proxies messages
- 3e89d969 - Merge branch 'master' into darwin/sink_mpi_physics
Toggle commit list-
981468b1...9814b234 - 2 commits from branch
added 6 commits
- 09de97b9 - Remove t_sink_formation_grav_count task
- 33b268fa - Remove cell_un/pack_sink_formation_grav_counts() fcts and struct
- ebaa951e - Rework the gravity send MPI dependencies between sink_formation, SF, SF_sink and grav_counts tasks
- 9c5d9b54 - Rework the gravity recv MPI dependencies for grav_counts tasks
- 685add96 - Add a TODO in cell_pack.c to rename the "pcell_sf_grav" struct
- b0f26404 - Rework the gravity MPI cell_unskip tasks activation for SF, SF_sink and sink_formation
Toggle commit listHere is an update.
I changed how I implemented the new sink and star gravity counts tasks. Now, we have only one send and recv gravity count and the
SF
,SF_sink
andsink_formation
dependencies.I changed the logic because I encountered crashes with the old way (in the
HomogeneousBox
level 5). After thinking a bit, I understood that thesend/recv_grav_counts
are agnostic of whether the new gparts come from sink or star formation. Thus, there was no reason to complicate the code and have functions that do the same thing but under different names and structs (with the same content).The new implementation seems to work better. I have not encountered problems with the
HomogeneousBox
at level 5. I will move to level 8 and test on a cluster with more MPI ranks.added 11 commits
-
b0f26404...8c6c3eae - 10 commits from branch
master
- 52260b17 - Merge branch 'master' into darwin/sink_mpi_physics
-
b0f26404...8c6c3eae - 10 commits from branch
Update: I'm trying to run the level 8 (~ 16 million particles)
HomogeneousBox
on our cluster. I encountered the following problem with parameters at the beginning of the simulation.Input Error: Incorrect sum of 0.000000 for tpwgts for constraint 0. Input Error: Incorrect tpwgts for partition 3 and constraint 0. Input Error: Incorrect sum of 0.000000 for tpwgts for constraint 0. Input Error: Incorrect sum of 0.000000 for tpwgts for constraint 0. [bonsai02:922573:0:922573] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xec623bf4) [bonsai02:922574:0:922574] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xbab3ebc8) ==== backtrace (tid: 922574) ==== 0 0x0000000000054db0 __GI___sigaction() :0 1 0x0000000000045d1d libparmetis__PartitionSmallGraph() ???:0 2 0x000000000002390e ParMETIS_V3_PartKway() ???:0 3 0x00000000004e242d pick_parmetis() /home/astro/roduit/program/swiftsim_2_debug/src/partition.c:1060 4 0x00000000004eac24 partition_initial_partition() /home/astro/roduit/program/swiftsim_2_debug/src/partition.c:2001 5 0x00000000004537aa engine_split() /home/astro/roduit/program/swiftsim_2_debug/src/engine.c:3104 6 0x000000000040bc58 main() /home/astro/roduit/program/swiftsim_2_debug/swift.c:1622 7 0x000000000003feb0 __libc_start_call_main() ???:0 8 0x000000000003ff60 __libc_start_main_alias_2() :0 9 0x000000000040e645 _start() ???:0 =================================
Also note that in src/partition.c, we need to add
t->type == task_type_star_formation_sink ||
for the tasks weights incheck_weights()
andpartition_gather_weights()
.
added 7 commits
-
52260b17...6abdf531 - 6 commits from branch
master
- 6190f3bb - Merge with master
-
52260b17...6abdf531 - 6 commits from branch
Update + notes for me: I got the Homogeneous box example running on our cluster.
Particle shift diff
I encountered:
[0002] [05966.4] runner_doiact_functions_stars.h:runner_dopair_branch_stars_feedback():1293: particle shift diff exceeds dx_max_sort in cell cj. cj->nodeID=7 ci->nodeID=2 d=6.515264e+00 sort_j[pjd].d=0.000000e+00 cj->stars.dx_max_sort=0.000000e+00 cj->stars.dx_max_sort_old=0.000000e+00, cellID=162922513 super->cellID=3538961cj->depth=4 cj->maxdepth=7
The ones for hydor+limiter/synchronisation are commented, given they often trigger with sinks. I need to comment this check, too.
Gravity
More related to sinks, I got after 1h30-2h of run:
[0006] [06705.9] runner_recv.c:runner_do_recv_gpart():198: Received a cell at an incorrect time c->ti_end_min=0, e->ti_current=4325478743670784.
and
[0000] [05899.3] runner_doiact_grav.c:runner_dopair_grav_pp():1234: Un-drifted gparts
I tried:
- with GEAR SF and this branch: same errors
- with GEAR SF and
master
(yesterday's morning version): no error
Next steps
- Print the cell IDs to track the cell's task dependencies
-
Compare this branch with
master
to find out the problem - Improve check in recv for sf_sink (I left it as a TODO...)
added 15 commits
-
6190f3bb...81ab35eb - 13 commits from branch
master
- ba41f28e - Merge with master
- 65e3af56 - Merge with master
-
6190f3bb...81ab35eb - 13 commits from branch
added 1 commit
- 8148a2ca - Convert r_cut to h in proxy.c and runner_recv.c
added 5 commits
Toggle commit listadded 6 commits
-
cd840aee...c184f132 - 5 commits from branch
master
- 227568fe - Merge branch 'master' into darwin/sink_mpi_physics
-
cd840aee...c184f132 - 5 commits from branch
added 19 commits
-
22f98136...b585e17e - 16 commits from branch
master
- 5d6a6b86 - Merge branch 'master' into darwin/sink_mpi_physics
- 1f7c7264 - Rework cell_unskip.c to unskip sink tasks on foreign nodes to not miss any dependencies
- 3ea99e69 - Add sink_ghost2 dependcy on the foreign cell to not miss any dependencies
Toggle commit list-
22f98136...b585e17e - 16 commits from branch
Hi everyone, I'm sorry for the late update. I was busy with my PhD, but this gave me time to reflect on the causes of various problems.
I turned off SF to ensure that the other dependencies were okay. It turned out that we still have problems with "undrifted part" and also with "sink was not swallowed." I inspected the cell dependencies, and they are sometimes broken. For instance,
do_gas_sink_swallow
is not linked tosink_swallow
. This often happens becausedo_gas_swallow
orsink_ghost2
were not activated. This convinced me that the task dependencies were fine inengine_maketasks.c
. This is probably a task activation problem.Since the task activation was mainly based on BHs and we do not create BHs on the fly, I had a look at the stars. (Remember that sinks are a mix of BHs and stars) Currently, I’m checking task activation for foreign cells to ensure they do not have broken dependencies because some tasks were not activated.
added 1 commit
- 6c5a24f1 - Add dependencies send_sink_rho --> send_sink_gas_swallow --> send_sink_merger...
I've just added dependencies between the sink send tasks to send things in an ordered manner when the interaction tasks are not created. This is similar to the gas send tasks.
These changes + the previous ones in
cell_unskip.c
made the code run for longer, ~ 12 hours instead of 3. That's an improvement!Notice that I disabled SF for now since it does not even work for SF runs (i.e. with
--star-formation
)Edited by DarwinA few notes for me later:
- I need to print multiple cell dependencies at the same time so that I can have the top cells, the grav and hydro super at the same time.
- Can I write the dependencies when an error is triggered? Probably yes, by modifying
scheduler_write_cell_dependencies()
to take acellID
and call it before throwing an error. This would greatly help in finding the dependencies/task activation errors. [16.04.2025: Answer: Yes this can be done and works extremely well!]
Here are three (working) dependency graphs for cell 4063249 (hydro super) for 0 and 3 step 592, rank 0 step 594.
Edited by Darwinadded 23 commits
-
6c5a24f1...524217c9 - 22 commits from branch
master
- 4fa32144 - Merge branch 'master' into darwin/sink_mpi_physics
-
6c5a24f1...524217c9 - 22 commits from branch
This is an update with good news. Thanks to point 2 in my previous post, I found the dependency/task activation error that generated my "undrafted sink" in sink_swallow.
I added a
scheduler_write_cell_dependencies_debug()
(to be committed) that writes the dependencies for the current cell, the hydro super, the grav super, and the top cells. Then, I call it before throwing an error. This is much more efficient than choosing a cell beforehand and praying to the programming gods that this cell will have problems.The following dependency graph shows this cell is missing a
sink_density
task. I don't know if this cell is local or foreign, but we are missingsink_in
,sink_density
,sink_density_ghost
, andsink_density_send/recv
. Notice that the stars have thestars_in
and density tasks.The next step is to have a look at the sink density task activation. This one is responsible for the activation of the missing tasks I mentioned.
Hey everyone! Here is an update.
I finally managed to restart with MPI. (I had a mismatch for some memory data structures for unknown reasons...). Now I have a ~ 5 minutes waiting time to debug! Debugging will be much faster.
For my previous post, I read the wrong line; the density tasks are created. However, they are flagged as skipped...
But, good news! The bug is reproducible and happens at the same timestep.
However, the cells can be different. At this timestep, I believe that a set of cells have their density tasks flagged as skipped, and at each restart, I find a different subset of these cells.
Next steps: understand why/how/where these tasks are flagged as skipped.
As a note, I will be on holiday (without laptop for once ;) from the 8th of May to the 17th. With a clear mind, debugging will be better!