Mesh gravity speed-ups
Implements two improvements:
- Use the threadpool to apply the Green function in the PM part of the code
- Use an asynchronous all-reduce to communicate the mesh across the MPI ranks.
To implement the second part, I have removed the call to space_split()
that was in space_rebuild()
. The space_split()
is now called after the communication has been initiated.
Merge request reports
Activity
For reference, here are the results that prompted these changes:
So here is a more interesting badly behaved run. It is EAGLE_50/127 with a mesh size of 1200 ran on 32 ranks on 16 nodes:
Time spent in the different code sections: - 'Engine Launch ' ( 53 calls, time: 1156.8316s): 30.7219% - 'Mesh Comunication ' ( 53 calls, time: 768.0140s): 20.3961% - 'Green Function ' ( 53 calls, time: 287.8016s): 7.6431% - 'Forward Fourier Transform ' ( 53 calls, time: 200.7225s): 5.3306% - 'Backwards Fourier Transform ' ( 53 calls, time: 196.3886s): 5.2155% - 'Space Rebuild ' ( 53 calls, time: 160.4167s): 4.2602% - 'Exchanging Cell Tags ' ( 53 calls, time: 140.5005s): 3.7313% - 'Engine Repartition ' ( 21 calls, time: 137.7499s): 3.6582% - 'Gpart Assignment ' ( 53 calls, time: 109.1116s): 2.8977% - 'Engine Marktasks ' ( 53 calls, time: 103.5276s): 2.7494% - 'Engine Exchange Cells ' ( 53 calls, time: 79.4477s): 2.1099% - 'Updating Particle Counts ' ( 53 calls, time: 57.4324s): 1.5252% - 'Reading Initial Conditions ' ( 1 calls, time: 56.5060s): 1.5006% - 'Engine Recompute Displacement Constraint' ( 53 calls, time: 50.7576s): 1.3480% - 'Engine Collect End Of Step ' ( 52 calls, time: 32.5977s): 0.8657%
51 steps, with a repartition or rebuild each step... BTW, options used are: --cooling --star-formation --feedback --stars --cosmology --hydro --self-gravity
added 74 commits
-
d77797f3...33813e81 - 73 commits from branch
master
- bc2f1761 - Merge remote-tracking branch 'origin/master' into parallel_mesh
-
d77797f3...33813e81 - 73 commits from branch
That seems more promising. The elapsed time is 10% faster (1556s to 1388s). Here is a comparison of the analysis from the
master
and thisbranch
. The Greens function is clearly faster, but deciding if the iallreduce is, is less clear. I suspect we just set that away and the full costs are just being paid at the next MPI calls, hence the extra time in updating particle counts and exchanging multipoles.+------------------------------------------+-------------+-------------+-----------------------+------------------------------------------+-------------+-------------+---------------+---------------+ | master_function | master_time | branch_time | time_diff | branch_function | master_perc | branch_perc | master_ncalls | branch_ncalls | +------------------------------------------+-------------+-------------+-----------------------+------------------------------------------+-------------+-------------+---------------+---------------+ | Engine Launch Task | 479.0461 | 478.8341 | 0.21200000000004593 | Engine Launch Task | 30.785 | 34.4783 | 53 | 53 | | Mesh Comunication | 408.56 | | | | 26.2554 | | 28 | | | Green Function | 154.9366 | 19.2588 | 135.6778 | Green Function | 9.9567 | 1.3867 | 28 | 28 | | Forward Fourier Transform | 108.4265 | 108.6355 | -0.20899999999998897 | Forward Fourier Transform | 6.9678 | 7.8223 | 28 | 28 | | Backwards Fourier Transform | 105.781 | 106.2935 | -0.5124999999999886 | Backwards Fourier Transform | 6.7978 | 7.6536 | 28 | 28 | | Gpart Assignment | 58.4658 | 58.8546 | -0.38879999999999626 | Gpart Assignment | 3.7572 | 4.2378 | 28 | 28 | | Space Rebuild | 40.2648 | 13.4273 | 26.8375 | Space Rebuild | 2.5875 | 0.9668 | 28 | 28 | | Engine Recompute Displacement Constraint | 25.3707 | 82.298 | -56.9273 | Engine Recompute Displacement Constraint | 1.6304 | 5.9258 | 28 | 28 | | Updating Particle Counts | 15.1419 | 122.843 | -107.7011 | Updating Particle Counts | 0.9731 | 8.8453 | 28 | 28 | | Engine Exchange Cells | 15.0863 | 20.9501 | -5.8637999999999995 | Engine Exchange Cells | 0.9695 | 1.5085 | 28 | 28 | | Engine Unskip | 14.8753 | 15.0331 | -0.15779999999999994 | Engine Unskip | 0.9559 | 1.0825 | 25 | 25 | | Dumping Restart Files | 9.8315 | 8.7476 | 1.0838999999999999 | Dumping Restart Files | 0.6318 | 0.6299 | 1 | 1 | | Engine Drift All | 9.4158 | 9.2831 | 0.1327000000000016 | Engine Drift All | 0.6051 | 0.6684 | 28 | 28 | | Reading Initial Conditions | 7.9241 | 8.3961 | -0.4720000000000004 | Reading Initial Conditions | 0.5092 | 0.6046 | 1 | 1 | | Engine Split Gas Particles | 6.0761 | 6.0528 | 0.023299999999999876 | Engine Split Gas Particles | 0.3905 | 0.4358 | 25 | 25 | | Exchanging Cell Tags | 5.0707 | 6.2089 | -1.1381999999999994 | Exchanging Cell Tags | 0.3259 | 0.4471 | 28 | 28 | | Engine Split | 4.307 | 4.2913 | 0.015700000000000713 | Engine Split | 0.2768 | 0.309 | 1 | 1 | | Space Init | 3.3497 | 3.2974 | 0.05229999999999979 | Space Init | 0.2153 | 0.2374 | 1 | 1 | | Recursively Linking Foreign Arrays | 2.8971 | 2.8864 | 0.010699999999999932 | Recursively Linking Foreign Arrays | 0.1862 | 0.2078 | 28 | 28 | | Engine Collect End Of Step | 2.0745 | 3.2632 | -1.1886999999999999 | Engine Collect End Of Step | 0.1333 | 0.235 | 52 | 52 | | Engine Marktasks | 1.8917 | 1.7128 | 0.17889999999999984 | Engine Marktasks | 0.1216 | 0.1233 | 28 | 28 | | Making Extra Hydroloop Tasks | 1.021 | 0.9933 | 0.027699999999999947 | Making Extra Hydroloop Tasks | 0.0656 | 0.0715 | 28 | 28 | | Communicating Rebuild Flag | 0.6106 | 0.6082 | 0.0024000000000000687 | Communicating Rebuild Flag | 0.0392 | 0.0438 | 51 | 51 | | Engine Print Stats | 0.488 | 0.3971 | 0.09089999999999998 | Engine Print Stats | 0.0314 | 0.0286 | 2 | 2 | | Engine Exchange Top Multipoles | 0.4707 | 205.3428 | -204.87210000000002 | Engine Exchange Top Multipoles | 0.0302 | 14.7856 | 28 | 28 | | Setting Super-Pointers | 0.4387 | 0.4055 | 0.03319999999999995 | Setting Super-Pointers | 0.0282 | 0.0292 | 28 | 28 | | Scheduler Reweight | 0.4318 | 0.4474 | -0.015600000000000003 | Scheduler Reweight | 0.0277 | 0.0322 | 28 | 28 | | Ranking The Tasks | 0.3568 | 0.3686 | -0.011799999999999977 | Ranking The Tasks | 0.0229 | 0.0265 | 28 | 28 | | Creating Recv Tasks | 0.3473 | 0.3334 | 0.013900000000000023 | Creating Recv Tasks | 0.0223 | 0.024 | 28 | 28 | | Counting And Linking Tasks | 0.2676 | 0.2684 | -8.000000000000229E-4 | Counting And Linking Tasks | 0.0172 | 0.0193 | 28 | 28 | | Setting Unlocks | 0.2451 | 0.2459 | -7.999999999999952E-4 | Setting Unlocks | 0.0158 | 0.0177 | 28 | 28 | | Making Gravity Tasks | 0.2363 | 0.2003 | 0.036000000000000004 | Making Gravity Tasks | 0.0152 | 0.0144 | 28 | 28 | | Engine Init | 0.2269 | 0.2238 | 0.0030999999999999917 | Engine Init | 0.0146 | 0.0161 | 1 | 1 | | Linking Gravity Tasks | 0.2002 | 0.2062 | -0.006000000000000005 | Linking Gravity Tasks | 0.0129 | 0.0148 | 28 | 28 | | Creating Send Tasks | 0.1443 | 0.1558 | -0.011499999999999982 | Creating Send Tasks | 0.0093 | 0.0112 | 28 | 28 | | Making Hydro Tasks | 0.1226 | 0.1219 | 7.000000000000062E-4 | Making Hydro Tasks | 0.0079 | 0.0088 | 28 | 28 | | Counting Number Of Foreign Particles | 0.1106 | 0.1086 | 0.0020000000000000018 | Counting Number Of Foreign Particles | 0.0071 | 0.0078 | 28 | 28 | | Space List Useful Top Level Cells | 0.0682 | 0.0676 | 6.000000000000033E-4 | Space List Useful Top Level Cells | 0.0044 | 0.0049 | 28 | 28 | | Engine Print Task Counts | 0.0342 | 0.04 | -0.0058 | Engine Print Task Counts | 0.0022 | 0.0029 | 81 | 81 | | Splitting Tasks | 0.0199 | 0.0198 | 9.99999999999994E-5 | Splitting Tasks | 0.0013 | 0.0014 | 28 | 28 | | Engine Drift Top Multipoles | 0.0165 | 0.0149 | 0.0016000000000000007 | Engine Drift Top Multipoles | 0.0011 | 0.0011 | 25 | 25 | | Engine Repartition Trigger | 0.003 | 0.0032 | -2.000000000000001E-4 | Engine Repartition Trigger | 2.0E-4 | 2.0E-4 | 51 | 51 | | Fof Search Tree | 0.0 | 0.0 | 0.0 | Fof Search Tree | 0.0 | 0.0 | 0 | 0 | | Engine Activate Fof Tasks | 0.0 | 0.0 | 0.0 | Engine Activate Fof Tasks | 0.0 | 0.0 | 0 | 0 | | Engine Make Fof Tasks | 0.0 | 0.0 | 0.0 | Engine Make Fof Tasks | 0.0 | 0.0 | 0 | 0 | | Fof Allocate | 0.0 | 0.0 | 0.0 | Fof Allocate | 0.0 | 0.0 | 0 | 0 | | Vr Copying Group Information Back | 0.0 | 0.0 | 0.0 | Vr Copying Group Information Back | 0.0 | 0.0 | 0 | 0 | | Vr Invokation Of Velociraptor | 0.0 | 0.0 | 0.0 | Vr Invokation Of Velociraptor | 0.0 | 0.0 | 0 | 0 | | Engine Launch Timestep | 0.0 | 0.0 | 0.0 | Engine Launch Timestep | 0.0 | 0.0 | 0 | 0 | | Vr Collecting Top-Level Cell Info | 0.0 | 0.0 | 0.0 | Vr Collecting Top-Level Cell Info | 0.0 | 0.0 | 0 | 0 | | Engine Launch Fof) | 0.0 | 0.0 | 0.0 | Engine Launch Fof) | 0.0 | 0.0 | 0 | 0 | | Engine Estimate Nr Tasks | 0.0 | 0.0 | 0.0 | Engine Estimate Nr Tasks | 0.0 | 0.0 | 0 | 0 | | Engine Unskip Timestep Communications | 0.0 | 0.0 | 0.0 | Engine Unskip Timestep Communications | 0.0 | 0.0 | 0 | 0 | | Engine Repartition | 0.0 | 0.0 | 0.0 | Engine Repartition | 0.0 | 0.0 | 0 | 0 | | Writing Particle Properties | 0.0 | 0.0 | 0.0 | Writing Particle Properties | 0.0 | 0.0 | 0 | 0 | | Vr Collecting Particle Info | 0.0 | 0.0 | 0.0 | Vr Collecting Particle Info | 0.0 | 0.0 | 0 | 0 | | Engine Launch Fof Comm | 0.0 | 0.0 | 0.0 | Engine Launch Fof Comm | 0.0 | 0.0 | 0 | 0 | | | | 26.8051 | | Space Split | | 1.9301 | | 28 | | | | 0.0018 | | Starting Mesh Communication | | 1.0E-4 | | 28 | | | | 2.0E-4 | | Waiting For Mesh Communication | | 0.0 | | 28 | +------------------------------------------+-------------+-------------+-----------------------+------------------------------------------+-------------+-------------+---------------+---------------+
Thanks! Glad to see some improvement.
In any case the communication here is much larger than the rebuild time so we don't have enough work to hide it. So the benefit is quite small. Overall it did not seem to slow down the rebuild (now space_rebuild + space_split) so I guess it can happily happen in the background.
Updating Particle Counts
is really a measure of the imbalance in the rebuild as it is the first global barrier after the rebuild has taken place. Maybe it's a sign that we slowed down all the communications in the rebuild because of the big all-reduce of the mesh? The same is true withEngine Recompute Displacement Constraint
. The work in there is minimal but requires an all-reduce so likely more a sign of imbalance in what comes before (i.e. the rebuild).That is the first synchronous MPI call, so will have an equivalent to MPI_Wait inside, which as we know has the side-effect of progressing any asynchronous MPI work. Effectively we move the work from earlier to there. My suspicion is that the other MPI calling functions are also taking some of this work as well.
To test these ideas I've hacked the code to include some MPI_Test calls, I expect that will show a different pattern to how the work is distributed. It might go away from this analysis, i.e. not in a timed function, but still take the same runtime. We'll see.
Edited by Peter W. DraperOK, so with a number of MPI_Test calls I do see some differences:
+------------------------------------------+-----------+-------------+------------------------+ | function | test_time | branch_time | time_diff | +------------------------------------------+-----------+-------------+------------------------+ | Updating Particle Counts | 59.3705 | 122.843 | -63.472500000000004 | | Engine Recompute Displacement Constraint | 37.8639 | 82.298 | -44.4341 | | Engine Exchange Top Multipoles | 192.5501 | 205.3428 | -12.792700000000025 |
but the time taken is roughly the same, so as I suspected we need to progress the call anyway, easy to hide in the engine, but less so here. Given that the original took 400s, the actual improvement is slight. Hah well. I suppose we could hope that smarter cards will make this faster, but it is far from clear if that would be best thought of asynchronously or not.
Agreed, I tried to speed this up with a dedicated thread, but that broke as it seems that you cannot have an
MPI_Allreduce
waiting while others start in other threads, they just started accepting each others requests. Also tried an asynchronous call with pollingMPI_Test
, which at least worked, but same result in terms of a speed-up, so there is no way to hide this work.Ok. I'll cherry-pick the relevant commits into a separate branch. As for the communication, maybe the global variable
I_MPI_ADJUST_ALLREDUCE=4
that @jch identified as beneficial on Curie could help on cosma as well.mentioned in merge request !1015 (merged)