Improvements to GIZMO implementation
Mixture of small improvements:
- Do not call external functions for min and max.
- Do not use double-precision constants.
- Reduce the number of divisions.
I believe this speeds up the Evrad collapse by 15-20%. @jborrow could you confirm this on your accuracy vs. time plot?
@bvandenbroucke I am likely to have made a mistake somewhere. Could you cross-check that it is all fine?
Merge request reports
Activity
@bvandenbroucke: I don't think it makes much of a difference here as we are likely memory-bandwidth limited at this stage. However, 1/sqrt is faster to compute than sqrt. Intel quotes 7 cycles for 1/sqrt and 18-21 cycles for sqrt on AVX, which is where I have an easy source of numbers. The multiplication adds one cycle to the total. The inverse itself is also quoted as 7 cycles on AVX.
@jborrow do you run with vectorization fully disabled?
We can also improve things by pushing the
extra_ghost
task further down the tree as it brings in a bit of a bottleneck at the moment. Finally, we may also want to force the splitting of tasks a bit more in the gizmo case to have more smaller tasks than few large ones as we have now. That will help with the load-balancing.If you are interested, here are the task plots:
added 12 commits
-
9818d481...a8f83cf0 - 11 commits from branch
master
- 35609c96 - Merge branch 'master' into 'gizmo_speedup'
-
9818d481...a8f83cf0 - 11 commits from branch
Here's the comparison. The black text shows speedup in that point; note that for one of them we have a 'negative' 4% speedup.
Edited by Josh BorrowOh, I know -- this was requested earlier by @matthieu
What is on the y axis exactly? And why did it change?
Edited by Bert VandenbrouckeThis is the "Chi Squared" of the density profile... Really it's the mean L2 norm contribution per particle.
It changes for a number of reasons. At low resolution, we only really see the outer part of the profile which is better modelled by SWIFT because of our lack of adaptive softening meaning the central regions end up with lower densities than the real solution.
At high resolution, we begin to 'tend to' the analytical solution more and more, hence the decrease, and overall the 'bump' in the middle at a 10k particle load.
Edited by Josh BorrowCan you look at the output and check if there are any obvious weird things?
We don't expect two runs to produce exactly the same result, as the order of the tasks is not fixed. That will cause stochastic changes between different runs. But I would not expect these changes to be that significant.
But then again, I don't know. It would be good if you could make a similar plot for different runs with the same code version, just to quantify the effect of stochastic changes.
They look a little different here, with the changed one looking like it has a tighter distribution. Colour codes for log(n_particles in bin). (Actual diff plot incoming).
Edited by Josh BorrowHere's the diff between those two (with square pixels, unfortunately...). This is the difference in numbers of particles in each pixel. (100k particle load)
Edited by Josh BorrowIt's the difference between the 2D histograms of particle positions in the log(rho)-log(r) plane, so it shows that one of the distributions (master) is 'wider' in rho, r than the other (gizmo_speedup).
Edited by Josh BorrowThis is small enough that you can run on the login nodes.
@jborrow : Are you sure you ran the old one with
--disable-vec
? I see important differences in the speed of the gravity code when vectorization is on or off.@nnrw56 I'll upload some proper profiles later. The main message is that is is rather flat. Essentially all interaction functions take slightly more time than with SPH. The other aspect is that the tasks are more expensive such that we should use a more aggressive task splitting criterion here. But for that we should benchmark on actual test cases and not on these which are mostly tests of accuracy.
Edited by Matthieu SchallerRunning with the Intel compiler, I find that the time to solution for Gizmo on 16 cores is 566s and I get 198s for Gadget2. So a factor 2.8x. A lot of it coming from the fact that Gizmo runs with shorter time-steps.
Whilst the Gadget2 is dominated by gravity, the Gizmo one is dominated by hydro.
Both schemes could benefit from a better choice of task-splitting threshold.
mentioned in commit d6c10934