SWIFTsim merge requestshttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests2023-11-30T15:24:19Zhttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1823Support optimization for the AMD aocc compiler2023-11-30T15:24:19ZPeter W. DraperSupport optimization for the AMD aocc compilerMake sure we use the optimized maths library and interprocedural optimization. Both are needed to get
optimization that works as well as other clang based-compilers.
Note in this MR we enable interprocedural optimization by default, unl...Make sure we use the optimized maths library and interprocedural optimization. Both are needed to get
optimization that works as well as other clang based-compilers.
Note in this MR we enable interprocedural optimization by default, unless the `--enable-debug`
or `--disable-optimization`
flags are used.Peter W. DraperPeter W. Draperhttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1804Fix AX_OPENMP performance regression2023-11-20T10:09:41ZPeter W. DraperFix AX_OPENMP performance regressionWhen moving from AC_OPENMP and AX_OPENMP we stopped AC_SUBST of OPENMP_CFLAGS and these haven't been used since. Fix that, but do that by only using the OPENMP_CFLAGS as needed, that is as compiler hints for vectorizing some loops in the...When moving from AC_OPENMP and AX_OPENMP we stopped AC_SUBST of OPENMP_CFLAGS and these haven't been used since. Fix that, but do that by only using the OPENMP_CFLAGS as needed, that is as compiler hints for vectorizing some loops in the gravity interactions and only requiring the OpenMP runtime when linking against an OpenMP FFTW library.
Fixes #865Peter W. DraperPeter W. Draperhttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1737Add cpuid for AMD Milan and Genoa CPUs.2023-07-06T17:11:31ZPeter W. DraperAdd cpuid for AMD Milan and Genoa CPUs.Add cpuid patterns for AMD Milan and Genoa chips. Without these will not get znver flags set correctly.
Note will need GCC 13 for Genoa znver4 support.Add cpuid patterns for AMD Milan and Genoa chips. Without these will not get znver flags set correctly.
Note will need GCC 13 for Genoa znver4 support.Peter W. DraperPeter W. Draperhttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1686Keep affinity of threadpool threads the same as the main thread on entry2023-01-30T13:50:10ZPeter W. DraperKeep affinity of threadpool threads the same as the main thread on entryKeep affinity of threadpool threads the same as the main thread on entry, regardless of later changes to the main thread.
This makes sure that we don't pin the threadpool threads to one CPU when the `engine_pin()` function is in effect.
...Keep affinity of threadpool threads the same as the main thread on entry, regardless of later changes to the main thread.
This makes sure that we don't pin the threadpool threads to one CPU when the `engine_pin()` function is in effect.
Fixes #846Peter W. DraperPeter W. Draperhttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1649Implement lock free subcell splitting and other speed ups.2022-12-22T16:04:59ZPeter W. DraperImplement lock free subcell splitting and other speed ups.Based on work in the zoom-master branch by Matthieu and Will
Avoids locking the memory used for subcells during spliting by having a pool of memory for each threadpool thread.
In simple tests this speeds things up nicely, especially dur...Based on work in the zoom-master branch by Matthieu and Will
Avoids locking the memory used for subcells during spliting by having a pool of memory for each threadpool thread.
In simple tests this speeds things up nicely, especially during step 0.
Also speeds the engine by using a more uniform and randomly assigned runner to a cell
(the owner) and using more information about the weights when scheduling tasks.
An EAGLE_50 volume ran on a single COSMA 8 node shows speed ups of the order 20% over
the initial 128 steps. This is also faster than 8xMPI on a node, but the reasons for that are
more nuanced and may be less for a proper MPI run (since for instance the MPI limits for
fastest possible step and communications within a step will return).Matthieu SchallerMatthieu Schallerhttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1509Parallel mesh assignment also in single-node case2022-04-22T12:59:47ZMatthieu SchallerParallel mesh assignment also in single-node caseChange the strategy used when interpolating the `gpart` onto the gravity mesh. Each thread (aka. top-level cell) constructs a local patch of the mesh and assigns its particles to it. When done, the patch is written to the global mesh usi...Change the strategy used when interpolating the `gpart` onto the gravity mesh. Each thread (aka. top-level cell) constructs a local patch of the mesh and assigns its particles to it. When done, the patch is written to the global mesh using atomics. This is now similar to what is done in the distributed MPI case.
This strategy seems to be beneficial when there are lots of particles in only a few cells, for instance in a zoom run. It seems to not be slower in other cases either.
The behaviour can be controlled by a runtime parameter (`Gravity:mesh_uses_local_patches` defaults to `1`) if one wants to roll back to the "old" per-particle atomic assignment.Bert VandenbrouckeBert Vandenbrouckehttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1458Refactoring of the time-step communication tasks2021-12-16T13:44:02ZMatthieu SchallerRefactoring of the time-step communication tasksSignificant re-factoring of the way the time-step sizes are being exchanged.
Same as !1455 but without the last batch of changes.
Summary:
* A new top-level task collects the time-step sizes from the super level to the top-level. This...Significant re-factoring of the way the time-step sizes are being exchanged.
Same as !1455 but without the last batch of changes.
Summary:
* A new top-level task collects the time-step sizes from the super level to the top-level. This was formerly done by making `engine_collect_end_of_step()` recurse.
* The `timestep`, `timestep_limiter`, and `timestep_sync` tasks all unlock that top-level task.
* `engine_collect_end_of_step()` now only loops (via threadpool) over the local top-level cells. No recursion any more.
* For each pair of top-level cells in the proxies we construct a pair of send/recv comm tasks.
* That comm task packs up the dt of the whole hierarchy sends it and unpacks the time-step sizes.
* The top-level time-step collection task unlocks the send.
* The individual per-species `tend` communication tasks that used to live at the super level are removed.
* The second call to `engine_launch()` done every step to deal with the timestep limiter effect is removed (as it is now properly dealt with by the top-level task dependency)
This should help speed up the smallest steps by reducing the level of the plateau we usually see in the "main sequence" plots.Peter W. DraperPeter W. Draperhttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1360Add NUMA interleave of memory allocations2021-11-04T18:16:46ZPeter W. DraperAdd NUMA interleave of memory allocationsAdds the option to interleave memory allocations uniformly across the NUMA regions
which are allowed by the CPU affinity mask.
Seems to help the threadpool when running EAGLE_50 on a single node of COSMA8,
see #760.Adds the option to interleave memory allocations uniformly across the NUMA regions
which are allowed by the CPU affinity mask.
Seems to help the threadpool when running EAGLE_50 on a single node of COSMA8,
see #760.Bert VandenbrouckeBert Vandenbrouckehttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1419In the EAGLE model, no feedback from 0-age stars2021-10-18T09:18:04ZMatthieu SchallerIn the EAGLE model, no feedback from 0-age starsBackport function from the 'stu' branch of the FLAMINGO-fork.
- Exploit the idea that 0-age stars don't actually do any feedback so we can skip the activation of star-feedback tasks if there are no stars.
- Simplify the feedback-task...Backport function from the 'stu' branch of the FLAMINGO-fork.
- Exploit the idea that 0-age stars don't actually do any feedback so we can skip the activation of star-feedback tasks if there are no stars.
- Simplify the feedback-task activation mechanism by unifying things into a single call.
- Add the scheme-level option to not run tasks if there are no stars at all.Matthieu SchallerMatthieu Schallerhttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1341Use only min mass gas in time step2021-10-15T10:58:12ZLoic HausammannUse only min mass gas in time stepFix #758
My student finished its simulations and everything seems alright. Do you wish to keep the current behavior for your simulations or are you fine if merge it like this?Fix #758
My student finished its simulations and everything seems alright. Do you wish to keep the current behavior for your simulations or are you fine if merge it like this?Matthieu SchallerMatthieu Schallerhttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1408MPI parallel mesh gravity - hashmap free2021-08-19T19:56:04ZMatthieu SchallerMPI parallel mesh gravity - hashmap freeImplements the mesh gravity calculation in an MPI-distributed fashion. This allows to run with much larger meshes and reduce the memory footprint.
Base implementation is similar to !1045.
* Each rank accumulates it's own contributi...Implements the mesh gravity calculation in an MPI-distributed fashion. This allows to run with much larger meshes and reduce the memory footprint.
Base implementation is similar to !1045.
* Each rank accumulates it's own contributions to the density field. These would need to be stored in some kind of sparse array to avoid making any assumptions about how the domain decomposition is done. Unlike !1045, we do not use a hashmap but rather a simple array since we can actually predict what we need and construct nice keys.
* Each rank allocates a slab of the full mesh as required by FFTW
* The mesh contributions are sent to whichever rank stores the corresponding slab. This is reasonably straightforward because the coordinates of a cell indicate which rank it's stored on.
* The FFT/Green function/CIC deconvolution/inverse FFT are carried out to make a slab-distributed mesh containing the potential
* Each rank calculates which cells it needs to calculate the potential gradient for its particles and requests them from whichever node they're on. This is slightly trickier than constructing the mesh because we need several cells around each particle to evaluate the gradient.
* We can then evaluate the potential and acceleration on the particles as usual.
Another difference with !1045 is that I am using a hand-written bucket sort instead of the `qsort()` that was used originally.
Since we only need a crude sort with a fixed small number of bins this is much much faster. (But still the current bottleneck)
The code needs to be configures with `--enable-mpi-mesh-gravity` and the runtime parameter `Gravity:distributed_mesh` must be set to 1.
Implements #524. Bypasses #716. Likely supersedes !1045.
Possible improvements:
- [x] Use the threadpool to speed-up the three bucket sorts. At least to construct the counts.
- [x] Construct the array of send/recv counts at the same time as the bucket sort. Or just from the bucket counts?
- [x] Check whether the potential patches can be smaller than currently and only cover the particle extent.Peter W. DraperPeter W. Draperhttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1385Pack the timebin for the limiter communications2021-06-24T12:24:32ZMatthieu SchallerPack the timebin for the limiter communicationsFor the `part` communication related to the time-step limiter, only send the `time-bin` (an `int8_t`) rather than the whole particle. The packing is done manually rather than by using an MPI-provided mechanism.For the `part` communication related to the time-step limiter, only send the `time-bin` (an `int8_t`) rather than the whole particle. The packing is done manually rather than by using an MPI-provided mechanism.Peter W. DraperPeter W. Draperhttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1356Add profiling for CSDS2021-05-12T08:22:10ZLoic HausammannAdd profiling for CSDSNothing fancy here, I am just adding a task category for the CSDS and adding some messages in order to track the time spent in the CSDS with `analyze_runtime.py`.Nothing fancy here, I am just adding a task category for the CSDS and adding some messages in order to track the time spent in the CSDS with `analyze_runtime.py`.Matthieu SchallerMatthieu Schallerhttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1352CSDS Move the index files into the reader.2021-05-11T13:52:50ZLoic HausammannCSDS Move the index files into the reader.Keeping in memory the information required for the index files during the simulation is costing far too much in term of memory but also performances. To reduce this, I have moved all the logic into the reader where we will not have as mu...Keeping in memory the information required for the index files during the simulation is costing far too much in term of memory but also performances. To reduce this, I have moved all the logic into the reader where we will not have as much information to track.
Some other changes that are worth to mention:
- In the CSDS yaml file, I am writing the initial number of particles. This value is used to initialize the arrays when generating the index files. The arrays can still growth if needed.
- When all the particles are written just before the first step, they are now flagged as being created.
- The data within the special flag contains now the particle. As the type was found from the index file, now we need a way to get it directly from the logfile. We still use 4 bytes but now we have 1 byte for the particle type, 2 bytes for any information (e.g. MPI rank for particles leaving/entering a rank) and 1 byte for the type of event (e.g. particle leaving/entering a rank, star formation, deletion, creation, ...).
- Initialize the time step counter to 0. There is no need to write the particles quickly at the start of the simulation as we are already manually writing them.Continuous Simulation Data StreamMatthieu SchallerMatthieu Schallerhttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1317Do not collect and store ti_end_max since we never make use of it for anything.2021-03-28T21:55:46ZMatthieu SchallerDo not collect and store ti_end_max since we never make use of it for anything.We collect and store ti_end_max for each particle type but never make use of it. We used to but that is not the case anymore as there is no real gain to be had anywhere.
Do you agree that it's sensible to remove it and hence shave off ...We collect and store ti_end_max for each particle type but never make use of it. We used to but that is not the case anymore as there is no real gain to be had anywhere.
Do you agree that it's sensible to remove it and hence shave off on the memory of the cells and pcell communications?Peter W. DraperPeter W. Draperhttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1302Logger multithreading2021-03-15T08:51:30ZLoic HausammannLogger multithreadingStill waiting on !1294 in order to add the number of threads in the parameter structure.
Nothing fancy here, I am just rewriting the `read_all` functions in order to use a threadpool. While the code is a bit slower now on 1 thread, I se...Still waiting on !1294 in order to add the number of threads in the parameter structure.
Nothing fancy here, I am just rewriting the `read_all` functions in order to use a threadpool. While the code is a bit slower now on 1 thread, I see a speedup of 2.3 when using 4 threads compared to the previous version. The speedup obtained is for the `SedovBlast3D`, therefore we are still dealing with a relatively small example where setting up the threadpool takes a non negligible amount of time.Continuous Simulation Data StreamPedro GonnetPedro Gonnethttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1278Add script showing the scaling of all the timed functions2021-01-30T11:09:12ZLoic HausammannAdd script showing the scaling of all the timed functionsThe script produces the following image:
![image](/uploads/7482149b65fe8eff1c362f4521ee6987/image.png)
The simulation needs to be run with `-v 1` in order to obtain the timing.
In order to facilitate future changes, I have moved the `la...The script produces the following image:
![image](/uploads/7482149b65fe8eff1c362f4521ee6987/image.png)
The simulation needs to be run with `-v 1` in order to obtain the timing.
In order to facilitate future changes, I have moved the `labels` outside from `analyze_runtime.py` and imported them in both the previous script and mine.Matthieu SchallerMatthieu Schallerhttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1243Unskip reduce recursion2021-01-21T16:23:19ZLoic HausammannUnskip reduce recursionThis branch optimizes the gravity unskip from this
![image](/uploads/9d92672248cb922ad27ad9781fb1729c/image.png)
to this
![image](/uploads/67dc18727be6d48cb867e807b6fae9af/image.png)
The idea is to flag the cells that have alread...This branch optimizes the gravity unskip from this
![image](/uploads/9d92672248cb922ad27ad9781fb1729c/image.png)
to this
![image](/uploads/67dc18727be6d48cb867e807b6fae9af/image.png)
The idea is to flag the cells that have already been unskipped in order to stop the recursion sooner for the unskip of the other top level cells.
The main improvement is for the pairs. The selfs are not as important but still give a slight speedup.Peter W. DraperPeter W. Draperhttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1225Implement gear's stars skipping2020-11-23T09:28:14ZLoic HausammannImplement gear's stars skippingAs the stars are spending most of their time without any feedback, I am skipping them when no supernovae are produced.
@matthieu Do you accept the modifications done to `runner_time_integration.c` or should I use some `#ifdef` and kee...As the stars are spending most of their time without any feedback, I am skipping them when no supernovae are produced.
@matthieu Do you accept the modifications done to `runner_time_integration.c` or should I use some `#ifdef` and keep only a single evolution function?Loic HausammannLoic Hausammannhttps://gitlab.cosma.dur.ac.uk/swift/swiftsim/-/merge_requests/1110Push the cooling task to a lower level to gain more parallelism2020-07-07T07:41:57ZMatthieu SchallerPush the cooling task to a lower level to gain more parallelismPush the cooling task to a lower level to gain more parallelism in the case of GRACKLE cooling for instance.
This now contains just the changes to the cooling task.Push the cooling task to a lower level to gain more parallelism in the case of GRACKLE cooling for instance.
This now contains just the changes to the cooling task.Matthieu SchallerMatthieu Schaller