SWIFTsim merge requests

SWIFTsim merge requests https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests 2023-11-30T15:24:19Z https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1823 Support optimization for the AMD aocc compiler 2023-11-30T15:24:19Z Peter W. Draper

Support optimization for the AMD aocc compiler

Make sure we use the optimized maths library and interprocedural optimization. Both are needed to get optimization that works as well as other clang based-compilers. Note in this MR we enable interprocedural optimization by default, unl... Make sure we use the optimized maths library and interprocedural optimization. Both are needed to get optimization that works as well as other clang based-compilers. Note in this MR we enable interprocedural optimization by default, unless the `--enable-debug` or `--disable-optimization` flags are used. compilation enhancement feature request performance vectorization Peter W. Draper Peter W. Draper https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1804 Fix AX_OPENMP performance regression 2023-11-20T10:09:41Z Peter W. Draper

Fix AX_OPENMP performance regression

When moving from AC_OPENMP and AX_OPENMP we stopped AC_SUBST of OPENMP_CFLAGS and these haven't been used since. Fix that, but do that by only using the OPENMP_CFLAGS as needed, that is as compiler hints for vectorizing some loops in the... When moving from AC_OPENMP and AX_OPENMP we stopped AC_SUBST of OPENMP_CFLAGS and these haven't been used since. Fix that, but do that by only using the OPENMP_CFLAGS as needed, that is as compiler hints for vectorizing some loops in the gravity interactions and only requiring the OpenMP runtime when linking against an OpenMP FFTW library. Fixes #865 architecture bug cleanup code health compilation Configuration enhancement performance vectorization Peter W. Draper Peter W. Draper https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1737 Add cpuid for AMD Milan and Genoa CPUs. 2023-07-06T17:11:31Z Peter W. Draper

Add cpuid for AMD Milan and Genoa CPUs.

Add cpuid patterns for AMD Milan and Genoa chips. Without these will not get znver flags set correctly. Note will need GCC 13 for Genoa znver4 support. Add cpuid patterns for AMD Milan and Genoa chips. Without these will not get znver flags set correctly. Note will need GCC 13 for Genoa znver4 support. architecture compilation Configuration enhancement performance vectorization Peter W. Draper Peter W. Draper https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1686 Keep affinity of threadpool threads the same as the main thread on entry 2023-01-30T13:50:10Z Peter W. Draper

Keep affinity of threadpool threads the same as the main thread on entry

Keep affinity of threadpool threads the same as the main thread on entry, regardless of later changes to the main thread. This makes sure that we don't pin the threadpool threads to one CPU when the `engine_pin()` function is in effect. ... Keep affinity of threadpool threads the same as the main thread on entry, regardless of later changes to the main thread. This makes sure that we don't pin the threadpool threads to one CPU when the `engine_pin()` function is in effect. Fixes #846 architecture bug enhancement performance Peter W. Draper Peter W. Draper https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1649 Implement lock free subcell splitting and other speed ups. 2022-12-22T16:04:59Z Peter W. Draper

Implement lock free subcell splitting and other speed ups.

Based on work in the zoom-master branch by Matthieu and Will Avoids locking the memory used for subcells during spliting by having a pool of memory for each threadpool thread. In simple tests this speeds things up nicely, especially dur... Based on work in the zoom-master branch by Matthieu and Will Avoids locking the memory used for subcells during spliting by having a pool of memory for each threadpool thread. In simple tests this speeds things up nicely, especially during step 0. Also speeds the engine by using a more uniform and randomly assigned runner to a cell (the owner) and using more information about the weights when scheduling tasks. An EAGLE_50 volume ran on a single COSMA 8 node shows speed ups of the order 20% over the initial 128 steps. This is also faster than 8xMPI on a node, but the reasons for that are more nuanced and may be less for a proper MPI run (since for instance the MPI limits for fastest possible step and communications within a step will return). enhancement performance Matthieu Schaller Matthieu Schaller https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1509 Parallel mesh assignment also in single-node case 2022-04-22T12:59:47Z Matthieu Schaller

Parallel mesh assignment also in single-node case

Change the strategy used when interpolating the `gpart` onto the gravity mesh. Each thread (aka. top-level cell) constructs a local patch of the mesh and assigns its particles to it. When done, the patch is written to the global mesh usi... Change the strategy used when interpolating the `gpart` onto the gravity mesh. Each thread (aka. top-level cell) constructs a local patch of the mesh and assigns its particles to it. When done, the patch is written to the global mesh using atomics. This is now similar to what is done in the distributed MPI case. This strategy seems to be beneficial when there are lots of particles in only a few cells, for instance in a zoom run. It seems to not be slower in other cases either. The behaviour can be controlled by a runtime parameter (`Gravity:mesh_uses_local_patches` defaults to `1`) if one wants to roll back to the "old" per-particle atomic assignment. performance SPH Bert Vandenbroucke Bert Vandenbroucke https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1458 Refactoring of the time-step communication tasks 2021-12-16T13:44:02Z Matthieu Schaller

Refactoring of the time-step communication tasks

Significant re-factoring of the way the time-step sizes are being exchanged. Same as !1455 but without the last batch of changes. Summary: * A new top-level task collects the time-step sizes from the super level to the top-level. This... Significant re-factoring of the way the time-step sizes are being exchanged. Same as !1455 but without the last batch of changes. Summary: * A new top-level task collects the time-step sizes from the super level to the top-level. This was formerly done by making `engine_collect_end_of_step()` recurse. * The `timestep`, `timestep_limiter`, and `timestep_sync` tasks all unlock that top-level task. * `engine_collect_end_of_step()` now only loops (via threadpool) over the local top-level cells. No recursion any more. * For each pair of top-level cells in the proxies we construct a pair of send/recv comm tasks. * That comm task packs up the dt of the whole hierarchy sends it and unpacks the time-step sizes. * The top-level time-step collection task unlocks the send. * The individual per-species `tend` communication tasks that used to live at the super level are removed. * The second call to `engine_launch()` done every step to deal with the timestep limiter effect is removed (as it is now properly dealt with by the top-level task dependency) This should help speed up the smallest steps by reducing the level of the plateau we usually see in the "main sequence" plots. performance Peter W. Draper Peter W. Draper https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1360 Add NUMA interleave of memory allocations 2021-11-04T18:16:46Z Peter W. Draper

Add NUMA interleave of memory allocations

Adds the option to interleave memory allocations uniformly across the NUMA regions which are allowed by the CPU affinity mask. Seems to help the threadpool when running EAGLE_50 on a single node of COSMA8, see #760. Adds the option to interleave memory allocations uniformly across the NUMA regions which are allowed by the CPU affinity mask. Seems to help the threadpool when running EAGLE_50 on a single node of COSMA8, see #760. engineering enhancement memory usage performance Bert Vandenbroucke Bert Vandenbroucke https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1419 In the EAGLE model, no feedback from 0-age stars 2021-10-18T09:18:04Z Matthieu Schaller

In the EAGLE model, no feedback from 0-age stars

Backport function from the 'stu' branch of the FLAMINGO-fork. - Exploit the idea that 0-age stars don't actually do any feedback so we can skip the activation of star-feedback tasks if there are no stars. - Simplify the feedback-task... Backport function from the 'stu' branch of the FLAMINGO-fork. - Exploit the idea that 0-age stars don't actually do any feedback so we can skip the activation of star-feedback tasks if there are no stars. - Simplify the feedback-task activation mechanism by unifying things into a single call. - Add the scheme-level option to not run tasks if there are no stars at all. EAGLE enhancement performance SPH Matthieu Schaller Matthieu Schaller https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1341 Use only min mass gas in time step 2021-10-15T10:58:12Z Loic Hausammann

Use only min mass gas in time step

Fix #758 My student finished its simulations and everything seems alright. Do you wish to keep the current behavior for your simulations or are you fine if merge it like this? Fix #758 My student finished its simulations and everything seems alright. Do you wish to keep the current behavior for your simulations or are you fine if merge it like this? GEAR performance SPH Matthieu Schaller Matthieu Schaller https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1408 MPI parallel mesh gravity - hashmap free 2021-08-19T19:56:04Z Matthieu Schaller

MPI parallel mesh gravity - hashmap free

Implements the mesh gravity calculation in an MPI-distributed fashion. This allows to run with much larger meshes and reduce the memory footprint. Base implementation is similar to !1045. * Each rank accumulates it's own contributi... Implements the mesh gravity calculation in an MPI-distributed fashion. This allows to run with much larger meshes and reduce the memory footprint. Base implementation is similar to !1045. * Each rank accumulates it's own contributions to the density field. These would need to be stored in some kind of sparse array to avoid making any assumptions about how the domain decomposition is done. Unlike !1045, we do not use a hashmap but rather a simple array since we can actually predict what we need and construct nice keys. * Each rank allocates a slab of the full mesh as required by FFTW * The mesh contributions are sent to whichever rank stores the corresponding slab. This is reasonably straightforward because the coordinates of a cell indicate which rank it's stored on. * The FFT/Green function/CIC deconvolution/inverse FFT are carried out to make a slab-distributed mesh containing the potential * Each rank calculates which cells it needs to calculate the potential gradient for its particles and requests them from whichever node they're on. This is slightly trickier than constructing the mesh because we need several cells around each particle to evaluate the gradient. * We can then evaluate the potential and acceleration on the particles as usual. Another difference with !1045 is that I am using a hand-written bucket sort instead of the `qsort()` that was used originally. Since we only need a crude sort with a fixed small number of bins this is much much faster. (But still the current bottleneck) The code needs to be configures with `--enable-mpi-mesh-gravity` and the runtime parameter `Gravity:distributed_mesh` must be set to 1. Implements #524. Bypasses #716. Likely supersedes !1045. Possible improvements: - [x] Use the threadpool to speed-up the three bucket sorts. At least to construct the counts. - [x] Construct the array of send/recv counts at the same time as the bucket sort. Or just from the bucket counts? - [x] Check whether the potential patches can be smaller than currently and only cover the particle extent. enhancement memory usage MPI performance Peter W. Draper Peter W. Draper https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1385 Pack the timebin for the limiter communications 2021-06-24T12:24:32Z Matthieu Schaller

Pack the timebin for the limiter communications

For the `part` communication related to the time-step limiter, only send the `time-bin` (an `int8_t`) rather than the whole particle. The packing is done manually rather than by using an MPI-provided mechanism. For the `part` communication related to the time-step limiter, only send the `time-bin` (an `int8_t`) rather than the whole particle. The packing is done manually rather than by using an MPI-provided mechanism. performance Peter W. Draper Peter W. Draper https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1356 Add profiling for CSDS 2021-05-12T08:22:10Z Loic Hausammann

Add profiling for CSDS

Nothing fancy here, I am just adding a task category for the CSDS and adding some messages in order to track the time spent in the CSDS with `analyze_runtime.py`. Nothing fancy here, I am just adding a task category for the CSDS and adding some messages in order to track the time spent in the CSDS with `analyze_runtime.py`. enhancement performance Matthieu Schaller Matthieu Schaller https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1352 CSDS Move the index files into the reader. 2021-05-11T13:52:50Z Loic Hausammann

CSDS Move the index files into the reader.

Keeping in memory the information required for the index files during the simulation is costing far too much in term of memory but also performances. To reduce this, I have moved all the logic into the reader where we will not have as mu... Keeping in memory the information required for the index files during the simulation is costing far too much in term of memory but also performances. To reduce this, I have moved all the logic into the reader where we will not have as much information to track. Some other changes that are worth to mention: - In the CSDS yaml file, I am writing the initial number of particles. This value is used to initialize the arrays when generating the index files. The arrays can still growth if needed. - When all the particles are written just before the first step, they are now flagged as being created. - The data within the special flag contains now the particle. As the type was found from the index file, now we need a way to get it directly from the logfile. We still use 4 bytes but now we have 1 byte for the particle type, 2 bytes for any information (e.g. MPI rank for particles leaving/entering a rank) and 1 byte for the type of event (e.g. particle leaving/entering a rank, star formation, deletion, creation, ...). - Initialize the time step counter to 0. There is no need to write the particles quickly at the start of the simulation as we are already manually writing them. Continuous Simulation Data Stream i/o memory usage performance Matthieu Schaller Matthieu Schaller https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1317 Do not collect and store ti_end_max since we never make use of it for anything. 2021-03-28T21:55:46Z Matthieu Schaller

Do not collect and store ti_end_max since we never make use of it for anything.

We collect and store ti_end_max for each particle type but never make use of it. We used to but that is not the case anymore as there is no real gain to be had anywhere. Do you agree that it's sensible to remove it and hence shave off ... We collect and store ti_end_max for each particle type but never make use of it. We used to but that is not the case anymore as there is no real gain to be had anywhere. Do you agree that it's sensible to remove it and hence shave off on the memory of the cells and pcell communications? memory usage performance Peter W. Draper Peter W. Draper https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1302 Logger multithreading 2021-03-15T08:51:30Z Loic Hausammann

Logger multithreading

Still waiting on !1294 in order to add the number of threads in the parameter structure. Nothing fancy here, I am just rewriting the `read_all` functions in order to use a threadpool. While the code is a bit slower now on 1 thread, I se... Still waiting on !1294 in order to add the number of threads in the parameter structure. Nothing fancy here, I am just rewriting the `read_all` functions in order to use a threadpool. While the code is a bit slower now on 1 thread, I see a speedup of 2.3 when using 4 threads compared to the previous version. The speedup obtained is for the `SedovBlast3D`, therefore we are still dealing with a relatively small example where setting up the threadpool takes a non negligible amount of time. Continuous Simulation Data Stream i/o performance Pedro Gonnet Pedro Gonnet https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1278 Add script showing the scaling of all the timed functions 2021-01-30T11:09:12Z Loic Hausammann

Add script showing the scaling of all the timed functions

The script produces the following image: ![image](/uploads/7482149b65fe8eff1c362f4521ee6987/image.png) The simulation needs to be run with `-v 1` in order to obtain the timing. In order to facilitate future changes, I have moved the `la... The script produces the following image: ![image](/uploads/7482149b65fe8eff1c362f4521ee6987/image.png) The simulation needs to be run with `-v 1` in order to obtain the timing. In order to facilitate future changes, I have moved the `labels` outside from `analyze_runtime.py` and imported them in both the previous script and mine. performance python scaling Matthieu Schaller Matthieu Schaller https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1243 Unskip reduce recursion 2021-01-21T16:23:19Z Loic Hausammann

Unskip reduce recursion

This branch optimizes the gravity unskip from this ![image](/uploads/9d92672248cb922ad27ad9781fb1729c/image.png) to this ![image](/uploads/67dc18727be6d48cb867e807b6fae9af/image.png) The idea is to flag the cells that have alread... This branch optimizes the gravity unskip from this ![image](/uploads/9d92672248cb922ad27ad9781fb1729c/image.png) to this ![image](/uploads/67dc18727be6d48cb867e807b6fae9af/image.png) The idea is to flag the cells that have already been unskipped in order to stop the recursion sooner for the unskip of the other top level cells. The main improvement is for the pairs. The selfs are not as important but still give a slight speedup. performance Peter W. Draper Peter W. Draper https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1225 Implement gear's stars skipping 2020-11-23T09:28:14Z Loic Hausammann

Implement gear's stars skipping

As the stars are spending most of their time without any feedback, I am skipping them when no supernovae are produced. @matthieu Do you accept the modifications done to `runner_time_integration.c` or should I use some `#ifdef` and kee... As the stars are spending most of their time without any feedback, I am skipping them when no supernovae are produced. @matthieu Do you accept the modifications done to `runner_time_integration.c` or should I use some `#ifdef` and keep only a single evolution function? GEAR performance Loic Hausammann Loic Hausammann https://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1110 Push the cooling task to a lower level to gain more parallelism 2020-07-07T07:41:57Z Matthieu Schaller

Push the cooling task to a lower level to gain more parallelism

Push the cooling task to a lower level to gain more parallelism in the case of GRACKLE cooling for instance. This now contains just the changes to the cooling task. Push the cooling task to a lower level to gain more parallelism in the case of GRACKLE cooling for instance. This now contains just the changes to the cooling task. enhancement GEAR performance SPH Matthieu Schaller Matthieu Schaller