SWIFTsim merge requestshttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests2022-08-06T03:35:00Zhttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1353Draft: Subtask speedup - Still requires work2022-08-06T03:35:00ZMatthieu SchallerDraft: Subtask speedup - Still requires workFor now, compile with `CFLAGS=-DONLY_SUBTASKS` added to the configuration line.
This introduces significantly faster neighbour finding in particle distributions with strong density gradients. For instance planetary applications or probl...For now, compile with `CFLAGS=-DONLY_SUBTASKS` added to the configuration line.
This introduces significantly faster neighbour finding in particle distributions with strong density gradients. For instance planetary applications or problematic feedback-disturbed galaxies.
Changes:
- Add the brute-force density checks to the planetary scheme (not just sphenix)
- Rewrite the recursion logic in the hydro and stars sub-task:
- The interaction functions have extra parameters to optionally consider particles only between 0.5 * width and width
- The subtask recursion now continues to lower level if we reach a level where h is too large. From that level on, we will just use the feature of only considering particles in the appropriate range of h.
- When recursing, only the h_max of _active_ particles is considered, not all particles.
Notes:
- We will consider c7adb289391c613ccebf2be0bd630b83f86fe171 separately.
- In a second phase, I'll remove entirely the self + pair tasks and keep only the subs.
Todo:
- [x] Verify RT isn't broken
- [x] Verify MPI runs are happy.
- [x] Port changes to the other hydro schemes.
- [ ] Handle particles drifting out of their cells.
Fixes #688.https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1393Slimming down of foreign gpart + reduced comm size2024-01-10T10:45:23ZMatthieu SchallerSlimming down of foreign gpart + reduced comm sizeHere are the changes to improve the foreign memory usage of `gpart`s and help with the related communications.
This includes:
- Creation of two new particle types. One for the regular foreign `gpart` and one for the `gpart` when runni...Here are the changes to improve the foreign memory usage of `gpart`s and help with the related communications.
This includes:
- Creation of two new particle types. One for the regular foreign `gpart` and one for the `gpart` when running FOF.
- Modification of the gravity cache construction to use the foreign particle type when acting on a foreign cell.
- Usage of the new packing task to fish out the fields we want. (No need to unpack anything however)
- Modification of the `gpart` and `fof` communications.
- Modification of the construction of the foreign buffers to accommodate both types.
All the physics changes are the same as what was tried in !1318. The difference is that the packing is done here by hand and that the case of FOF is correctly handled.Peter W. DraperPeter W. Draperhttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1654Draft: NUMA aware pinning of queues and runners.2023-09-11T10:41:25ZPeter W. DraperDraft: NUMA aware pinning of queues and runners.Adds the capability to pin queues and runners within NUMA regions.
Adds queue selection by tasks on the basis of the NUMA region that holds the start of it's main data area.
Adds the spreading of swift allocated memory using larger in...Adds the capability to pin queues and runners within NUMA regions.
Adds queue selection by tasks on the basis of the NUMA region that holds the start of it's main data area.
Adds the spreading of swift allocated memory using larger interleave chunks (default for those is 4k).
On COSMA8 this shows speed improvements over the existing master, even with pinning and interleave.
(Based on the !1649 so we also have those improvements, now merged.)
Not sure how serious these changes are yet, as we need to add an additional argument to all the swift_free() calls so that the
memory spread can be undone, also the memory alignment is done using page boundaries (4k). Also requires that there is
a one to one correspondence between queues and runners as these are pinned to NUMA regions in pairs.Peter W. DraperPeter W. Draperhttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1661Draft: Implement a memcpy() clone that uses the threadpool2023-09-11T10:45:49ZPeter W. DraperDraft: Implement a memcpy() clone that uses the threadpoolUsing multiple threads to copy data can make better use of the available memory bandwidth when multiple NUMA
regions are being used. This is an attempt to exploit that.
Experiments show that using it should be done only for quite large ...Using multiple threads to copy data can make better use of the available memory bandwidth when multiple NUMA
regions are being used. This is an attempt to exploit that.
Experiments show that using it should be done only for quite large data, so a heuristic is in place
to support that, which requires a number of 4k pages, at least 1 seems to be required, which makes sense,
but more are required in reality (COSMA8 AMD 128 core, 8 NUMA regions). Surprisingly even then using all the
threads is still not a guarantor of the best speed, so we also apply a limit to the number of threads
that will be used to 25% of the total, using more than 1 gives a good improvement which tails off
rapidly.
The actual gains here are modest, the largest come from particle splitting, but only when the buffers
need reallocating and replication, which is only used for testing.
So the question is, is this worth it...Peter W. DraperPeter W. Draperhttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1665Draft: Add threadpool mappers for scheduler_reweight2023-09-11T10:48:37ZPeter W. DraperDraft: Add threadpool mappers for scheduler_reweightIs better than serial code (100ms compared to 500ms in my test), but has one issue, which is some weights will be less optimal than before as the accumulation of weights will not happen between threads. This is why we use a uniform chunk...Is better than serial code (100ms compared to 500ms in my test), but has one issue, which is some weights will be less optimal than before as the accumulation of weights will not happen between threads. This is why we use a uniform chunk to keep the work load from too many splits.
For any reasonable set of tasks this is a small issue, there is no effect measureable that I can see.Matthieu SchallerMatthieu Schaller