Skip to content
Snippets Groups Projects

Improvements to GIZMO implementation

Merged Matthieu Schaller requested to merge gizmo_speedup into master

Mixture of small improvements:

  • Do not call external functions for min and max.
  • Do not use double-precision constants.
  • Reduce the number of divisions.

I believe this speeds up the Evrad collapse by 15-20%. @jborrow could you confirm this on your accuracy vs. time plot?

@bvandenbroucke I am likely to have made a mistake somewhere. Could you cross-check that it is all fine?

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • What settings are you using for your Evrard? The default ones in the parameterfile? In my tests, it seemed that only 1/10th of the softening was required. I will look into this more on Monday, as well as produce an accuracy(t) plot with this version v.s. the master branch.

  • I was running the default setup. If you think the default value of the softening is not good, we should change it.

  • I'll run another test to make sure tomorrow.

  • Also, what particle load did you use?

  • 100 - 100k for a convergence test.

  • You run the Evrard collapse with as little as 100 particles?

  • Yeah not for any particular reason -- just because it's part of my for loop...

  • Surely that cannot even remotely look like the real answer...

  • I cannot see any obvious problems with the changes you made.

    I do have a question: why is r_inv = 1.f/sqrtf(r2) and then r = r2 * r_inv faster than r = sqrtf(r2) and r_inv = 1./r?

  • No, you're right -- 100 particles is utter nonsense. But with 1000, we're getting there:

    Screen_Shot_2018-04-09_at_09.28.20

  • @bvandenbroucke: I don't think it makes much of a difference here as we are likely memory-bandwidth limited at this stage. However, 1/sqrt is faster to compute than sqrt. Intel quotes 7 cycles for 1/sqrt and 18-21 cycles for sqrt on AVX, which is where I have an easy source of numbers. The multiplication adds one cycle to the total. The inverse itself is also quoted as 7 cycles on AVX.

  • Note that the compiler is in general not allowed to replace sqrt(x2) by x2 * (1/sqrt(x2)) since it may lead to differences in the last bit. Some compilers do that at sufficiently high optimisation levels. However, there are no good reasons not to do so even in the non-optimised code.

  • Interesting. Does that work for double precision as well? Because I have quite a lot of similar code that could maybe benefit from this optimization.

  • Yes, it is also true for double precision. With everything being typically a factor 2x slower.

  • @jborrow do you run with vectorization fully disabled?

  • We can also improve things by pushing the extra_ghost task further down the tree as it brings in a bit of a bottleneck at the moment. Finally, we may also want to force the splitting of tasks a bit more in the gizmo case to have more smaller tasks than few large ones as we have now. That will help with the load-balancing.

  • One other thing worth considering is that the Gadget-2 scheme takes 4x fewer steps to reach the end time than Gizmo here.

  • If you are interested, here are the task plots:

  • These checks are in the queue, should be ready for analysis tomorrow morning.

  • Matthieu Schaller added 12 commits

    added 12 commits

    Compare with previous version

  • Any objections to merging these changes?

  • I have the runs completed, give me 30 mins and I will get a confirmation of the speedup

  • Joining this party a bit late, but has anybody ever profiled the GIZMO parts of the code? I'd be curious to see where it's taking the most cycles and take a closer look, if anybody is interested.

  • Here's the comparison. The black text shows speedup in that point; note that for one of them we have a 'negative' 4% speedup. comparison

    Edited by Josh Borrow
  • That's a benchmark, not a profile :) What I was looking for is a breakdown of which function uses how much CPU.

  • Oh, I know -- this was requested earlier by @matthieu

  • But I suppose this motivates running a full profile :p

  • What is on the y axis exactly? And why did it change?

    Edited by Bert Vandenbroucke
  • This is the "Chi Squared" of the density profile... Really it's the mean L2 norm contribution per particle.

    It changes for a number of reasons. At low resolution, we only really see the outer part of the profile which is better modelled by SWIFT because of our lack of adaptive softening meaning the central regions end up with lower densities than the real solution.

    At high resolution, we begin to 'tend to' the analytical solution more and more, hence the decrease, and overall the 'bump' in the middle at a 10k particle load.

    Edited by Josh Borrow
  • Right, then why did it change? These changes were not supposed to change anything to the result. Is this just stochastic noise, or is something broken?

  • Oh you mean between master and the changes? I don't actually know. I will submit another job and get it to re-run.

  • Can you look at the output and check if there are any obvious weird things?

    We don't expect two runs to produce exactly the same result, as the order of the tasks is not fixed. That will cause stochastic changes between different runs. But I would not expect these changes to be that significant.

    But then again, I don't know. It would be good if you could make a similar plot for different runs with the same code version, just to quantify the effect of stochastic changes.

  • You could also just run with a single thread, in which case the execution order is identical.

  • They look a little different here, with the changed one looking like it has a tighter distribution. Colour codes for log(n_particles in bin). (Actual diff plot incoming).

    actual_difference

    Edited by Josh Borrow
  • Here's the diff between those two (with square pixels, unfortunately...). This is the difference in numbers of particles in each pixel. (100k particle load) diff_plot

    Edited by Josh Borrow
  • The difference is even more pronounced at 10k particle load, see the following plots: actual_difference diff_plot

  • Not sure I understand your diff plots. I interpret those as the master being below the changes, but I don't see any hint of that in the actual density profiles.

  • It's the difference between the 2D histograms of particle positions in the log(rho)-log(r) plane, so it shows that one of the distributions (master) is 'wider' in rho, r than the other (gizmo_speedup).

    Edited by Josh Borrow
  • Then again, this could just be an artifact of the rejection sampling initial conditions. I am, I suppose, generating different initial conditions for each run, even if they have the same particle number, which is slightly dodgy.

  • Can you run the exact same thing then?

  • Yes, it's in the queue now.

  • This is small enough that you can run on the login nodes.

    @jborrow : Are you sure you ran the old one with --disable-vec? I see important differences in the speed of the gravity code when vectorization is on or off.

    @nnrw56 I'll upload some proper profiles later. The main message is that is is rather flat. Essentially all interaction functions take slightly more time than with SPH. The other aspect is that the tasks are more expensive such that we should use a more aggressive task splitting criterion here. But for that we should benchmark on actual test cases and not on these which are mostly tests of accuracy.

    Edited by Matthieu Schaller
  • I ran neither with --disable-vec, no.

  • What compiler are you using?

  • Running with the Intel compiler, I find that the time to solution for Gizmo on 16 cores is 566s and I get 198s for Gadget2. So a factor 2.8x. A lot of it coming from the fact that Gizmo runs with shorter time-steps.

    Whilst the Gadget2 is dominated by gravity, the Gizmo one is dominated by hydro.

    Both schemes could benefit from a better choice of task-splitting threshold.

    Here is the vTune output for that run with Gizmo: Screenshot_20180410_172749

  • added 2 commits

    • b403c0fa - Removed unused isothermal Riemann solver. Added generic input and output debug…
    • d847056c - Added debug check on particle drift distance.

    Compare with previous version

  • So it does in fact look like the differences in accuracy were spurious and caused by the rejection sampling of the initial conditions (see below).

    However, with GCC4.8.1 I only see a ~5% improvement over the master.

    comparison

    actual_differencediff_plot

  • Thanks! That's a relief... Also, a sign that the same ICs should be used for comparison as noise is still important here.

    Also, GCC 4.8 is really bad at vectorizing stuff and optimizing code. I'd bet that ICC would really show an improvement here.

  • mentioned in commit d6c10934

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
Please register or sign in to reply
Loading