Skip to content
Snippets Groups Projects

Mesh gravity speed-ups

Closed Matthieu Schaller requested to merge parallel_mesh into master

Implements two improvements:

  • Use the threadpool to apply the Green function in the PM part of the code
  • Use an asynchronous all-reduce to communicate the mesh across the MPI ranks.

To implement the second part, I have removed the call to space_split() that was in space_rebuild(). The space_split() is now called after the communication has been initiated.

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • For reference, here are the results that prompted these changes:

    So here is a more interesting badly behaved run. It is EAGLE_50/127 with a mesh size of 1200 ran on 32 ranks on 16 nodes:

    Time spent in the different code sections:
     - 'Engine Launch                           ' (   53 calls, time: 1156.8316s): 30.7219%
     - 'Mesh Comunication                       ' (   53 calls, time: 768.0140s): 20.3961%
     - 'Green Function                          ' (   53 calls, time: 287.8016s): 7.6431%
     - 'Forward Fourier Transform               ' (   53 calls, time: 200.7225s): 5.3306%
     - 'Backwards Fourier Transform             ' (   53 calls, time: 196.3886s): 5.2155%
     - 'Space Rebuild                           ' (   53 calls, time: 160.4167s): 4.2602%
     - 'Exchanging Cell Tags                    ' (   53 calls, time: 140.5005s): 3.7313%
     - 'Engine Repartition                      ' (   21 calls, time: 137.7499s): 3.6582%
     - 'Gpart Assignment                        ' (   53 calls, time: 109.1116s): 2.8977%
     - 'Engine Marktasks                        ' (   53 calls, time: 103.5276s): 2.7494%
     - 'Engine Exchange Cells                   ' (   53 calls, time: 79.4477s): 2.1099%
     - 'Updating Particle Counts                ' (   53 calls, time: 57.4324s): 1.5252%
     - 'Reading Initial Conditions              ' (    1 calls, time: 56.5060s): 1.5006%
     - 'Engine Recompute Displacement Constraint' (   53 calls, time: 50.7576s): 1.3480%
     - 'Engine Collect End Of Step              ' (   52 calls, time: 32.5977s): 0.8657%

    51 steps, with a repartition or rebuild each step... BTW, options used are: --cooling --star-formation --feedback --stars --cosmology --hydro --self-gravity

  • added 1 commit

    • d77797f3 - Compilation fixes for ICC and GCC on cosma.

    Compare with previous version

  • Peter W. Draper added 74 commits

    added 74 commits

    Compare with previous version

  • added 1 commit

    • 630987ff - Fix spelling of communication

    Compare with previous version

  • Had a look at this and it is working so far, but in my simple tests I don't see much sign of a speed up. Will try the same example as above next.

  • Thanks for checking. Indeed I would not expect much benefit in the normal scenarios we currently run.

    Only for bigger grids like your test or John's really big runs.

  • That seems more promising. The elapsed time is 10% faster (1556s to 1388s). Here is a comparison of the analysis from the master and this branch. The Greens function is clearly faster, but deciding if the iallreduce is, is less clear. I suspect we just set that away and the full costs are just being paid at the next MPI calls, hence the extra time in updating particle counts and exchanging multipoles.

    +------------------------------------------+-------------+-------------+-----------------------+------------------------------------------+-------------+-------------+---------------+---------------+
    | master_function                          | master_time | branch_time | time_diff             | branch_function                          | master_perc | branch_perc | master_ncalls | branch_ncalls |
    +------------------------------------------+-------------+-------------+-----------------------+------------------------------------------+-------------+-------------+---------------+---------------+
    | Engine Launch Task                       | 479.0461    | 478.8341    | 0.21200000000004593   | Engine Launch Task                       | 30.785      | 34.4783     | 53            | 53            |
    | Mesh Comunication                        | 408.56      |             |                       |                                          | 26.2554     |             | 28            |               |
    | Green Function                           | 154.9366    | 19.2588     | 135.6778              | Green Function                           | 9.9567      | 1.3867      | 28            | 28            |
    | Forward Fourier Transform                | 108.4265    | 108.6355    | -0.20899999999998897  | Forward Fourier Transform                | 6.9678      | 7.8223      | 28            | 28            |
    | Backwards Fourier Transform              | 105.781     | 106.2935    | -0.5124999999999886   | Backwards Fourier Transform              | 6.7978      | 7.6536      | 28            | 28            |
    | Gpart Assignment                         | 58.4658     | 58.8546     | -0.38879999999999626  | Gpart Assignment                         | 3.7572      | 4.2378      | 28            | 28            |
    | Space Rebuild                            | 40.2648     | 13.4273     | 26.8375               | Space Rebuild                            | 2.5875      | 0.9668      | 28            | 28            |
    | Engine Recompute Displacement Constraint | 25.3707     | 82.298      | -56.9273              | Engine Recompute Displacement Constraint | 1.6304      | 5.9258      | 28            | 28            |
    | Updating Particle Counts                 | 15.1419     | 122.843     | -107.7011             | Updating Particle Counts                 | 0.9731      | 8.8453      | 28            | 28            |
    | Engine Exchange Cells                    | 15.0863     | 20.9501     | -5.8637999999999995   | Engine Exchange Cells                    | 0.9695      | 1.5085      | 28            | 28            |
    | Engine Unskip                            | 14.8753     | 15.0331     | -0.15779999999999994  | Engine Unskip                            | 0.9559      | 1.0825      | 25            | 25            |
    | Dumping Restart Files                    | 9.8315      | 8.7476      | 1.0838999999999999    | Dumping Restart Files                    | 0.6318      | 0.6299      | 1             | 1             |
    | Engine Drift All                         | 9.4158      | 9.2831      | 0.1327000000000016    | Engine Drift All                         | 0.6051      | 0.6684      | 28            | 28            |
    | Reading Initial Conditions               | 7.9241      | 8.3961      | -0.4720000000000004   | Reading Initial Conditions               | 0.5092      | 0.6046      | 1             | 1             |
    | Engine Split Gas Particles               | 6.0761      | 6.0528      | 0.023299999999999876  | Engine Split Gas Particles               | 0.3905      | 0.4358      | 25            | 25            |
    | Exchanging Cell Tags                     | 5.0707      | 6.2089      | -1.1381999999999994   | Exchanging Cell Tags                     | 0.3259      | 0.4471      | 28            | 28            |
    | Engine Split                             | 4.307       | 4.2913      | 0.015700000000000713  | Engine Split                             | 0.2768      | 0.309       | 1             | 1             |
    | Space Init                               | 3.3497      | 3.2974      | 0.05229999999999979   | Space Init                               | 0.2153      | 0.2374      | 1             | 1             |
    | Recursively Linking Foreign Arrays       | 2.8971      | 2.8864      | 0.010699999999999932  | Recursively Linking Foreign Arrays       | 0.1862      | 0.2078      | 28            | 28            |
    | Engine Collect End Of Step               | 2.0745      | 3.2632      | -1.1886999999999999   | Engine Collect End Of Step               | 0.1333      | 0.235       | 52            | 52            |
    | Engine Marktasks                         | 1.8917      | 1.7128      | 0.17889999999999984   | Engine Marktasks                         | 0.1216      | 0.1233      | 28            | 28            |
    | Making Extra Hydroloop Tasks             | 1.021       | 0.9933      | 0.027699999999999947  | Making Extra Hydroloop Tasks             | 0.0656      | 0.0715      | 28            | 28            |
    | Communicating Rebuild Flag               | 0.6106      | 0.6082      | 0.0024000000000000687 | Communicating Rebuild Flag               | 0.0392      | 0.0438      | 51            | 51            |
    | Engine Print Stats                       | 0.488       | 0.3971      | 0.09089999999999998   | Engine Print Stats                       | 0.0314      | 0.0286      | 2             | 2             |
    | Engine Exchange Top Multipoles           | 0.4707      | 205.3428    | -204.87210000000002   | Engine Exchange Top Multipoles           | 0.0302      | 14.7856     | 28            | 28            |
    | Setting Super-Pointers                   | 0.4387      | 0.4055      | 0.03319999999999995   | Setting Super-Pointers                   | 0.0282      | 0.0292      | 28            | 28            |
    | Scheduler Reweight                       | 0.4318      | 0.4474      | -0.015600000000000003 | Scheduler Reweight                       | 0.0277      | 0.0322      | 28            | 28            |
    | Ranking The Tasks                        | 0.3568      | 0.3686      | -0.011799999999999977 | Ranking The Tasks                        | 0.0229      | 0.0265      | 28            | 28            |
    | Creating Recv Tasks                      | 0.3473      | 0.3334      | 0.013900000000000023  | Creating Recv Tasks                      | 0.0223      | 0.024       | 28            | 28            |
    | Counting And Linking Tasks               | 0.2676      | 0.2684      | -8.000000000000229E-4 | Counting And Linking Tasks               | 0.0172      | 0.0193      | 28            | 28            |
    | Setting Unlocks                          | 0.2451      | 0.2459      | -7.999999999999952E-4 | Setting Unlocks                          | 0.0158      | 0.0177      | 28            | 28            |
    | Making Gravity Tasks                     | 0.2363      | 0.2003      | 0.036000000000000004  | Making Gravity Tasks                     | 0.0152      | 0.0144      | 28            | 28            |
    | Engine Init                              | 0.2269      | 0.2238      | 0.0030999999999999917 | Engine Init                              | 0.0146      | 0.0161      | 1             | 1             |
    | Linking Gravity Tasks                    | 0.2002      | 0.2062      | -0.006000000000000005 | Linking Gravity Tasks                    | 0.0129      | 0.0148      | 28            | 28            |
    | Creating Send Tasks                      | 0.1443      | 0.1558      | -0.011499999999999982 | Creating Send Tasks                      | 0.0093      | 0.0112      | 28            | 28            |
    | Making Hydro Tasks                       | 0.1226      | 0.1219      | 7.000000000000062E-4  | Making Hydro Tasks                       | 0.0079      | 0.0088      | 28            | 28            |
    | Counting Number Of Foreign Particles     | 0.1106      | 0.1086      | 0.0020000000000000018 | Counting Number Of Foreign Particles     | 0.0071      | 0.0078      | 28            | 28            |
    | Space List Useful Top Level Cells        | 0.0682      | 0.0676      | 6.000000000000033E-4  | Space List Useful Top Level Cells        | 0.0044      | 0.0049      | 28            | 28            |
    | Engine Print Task Counts                 | 0.0342      | 0.04        | -0.0058               | Engine Print Task Counts                 | 0.0022      | 0.0029      | 81            | 81            |
    | Splitting Tasks                          | 0.0199      | 0.0198      | 9.99999999999994E-5   | Splitting Tasks                          | 0.0013      | 0.0014      | 28            | 28            |
    | Engine Drift Top Multipoles              | 0.0165      | 0.0149      | 0.0016000000000000007 | Engine Drift Top Multipoles              | 0.0011      | 0.0011      | 25            | 25            |
    | Engine Repartition Trigger               | 0.003       | 0.0032      | -2.000000000000001E-4 | Engine Repartition Trigger               | 2.0E-4      | 2.0E-4      | 51            | 51            |
    | Fof Search Tree                          | 0.0         | 0.0         | 0.0                   | Fof Search Tree                          | 0.0         | 0.0         | 0             | 0             |
    | Engine Activate Fof Tasks                | 0.0         | 0.0         | 0.0                   | Engine Activate Fof Tasks                | 0.0         | 0.0         | 0             | 0             |
    | Engine Make Fof Tasks                    | 0.0         | 0.0         | 0.0                   | Engine Make Fof Tasks                    | 0.0         | 0.0         | 0             | 0             |
    | Fof Allocate                             | 0.0         | 0.0         | 0.0                   | Fof Allocate                             | 0.0         | 0.0         | 0             | 0             |
    | Vr Copying Group Information Back        | 0.0         | 0.0         | 0.0                   | Vr Copying Group Information Back        | 0.0         | 0.0         | 0             | 0             |
    | Vr Invokation Of Velociraptor            | 0.0         | 0.0         | 0.0                   | Vr Invokation Of Velociraptor            | 0.0         | 0.0         | 0             | 0             |
    | Engine Launch Timestep                   | 0.0         | 0.0         | 0.0                   | Engine Launch Timestep                   | 0.0         | 0.0         | 0             | 0             |
    | Vr Collecting Top-Level Cell Info        | 0.0         | 0.0         | 0.0                   | Vr Collecting Top-Level Cell Info        | 0.0         | 0.0         | 0             | 0             |
    | Engine Launch Fof)                       | 0.0         | 0.0         | 0.0                   | Engine Launch Fof)                       | 0.0         | 0.0         | 0             | 0             |
    | Engine Estimate Nr Tasks                 | 0.0         | 0.0         | 0.0                   | Engine Estimate Nr Tasks                 | 0.0         | 0.0         | 0             | 0             |
    | Engine Unskip Timestep Communications    | 0.0         | 0.0         | 0.0                   | Engine Unskip Timestep Communications    | 0.0         | 0.0         | 0             | 0             |
    | Engine Repartition                       | 0.0         | 0.0         | 0.0                   | Engine Repartition                       | 0.0         | 0.0         | 0             | 0             |
    | Writing Particle Properties              | 0.0         | 0.0         | 0.0                   | Writing Particle Properties              | 0.0         | 0.0         | 0             | 0             |
    | Vr Collecting Particle Info              | 0.0         | 0.0         | 0.0                   | Vr Collecting Particle Info              | 0.0         | 0.0         | 0             | 0             |
    | Engine Launch Fof Comm                   | 0.0         | 0.0         | 0.0                   | Engine Launch Fof Comm                   | 0.0         | 0.0         | 0             | 0             |
    |                                          |             | 26.8051     |                       | Space Split                              |             | 1.9301      |               | 28            |
    |                                          |             | 0.0018      |                       | Starting Mesh Communication              |             | 1.0E-4      |               | 28            |
    |                                          |             | 2.0E-4      |                       | Waiting For Mesh Communication           |             | 0.0         |               | 28            |
    +------------------------------------------+-------------+-------------+-----------------------+------------------------------------------+-------------+-------------+---------------+---------------+
  • Thanks! Glad to see some improvement.

    In any case the communication here is much larger than the rebuild time so we don't have enough work to hide it. So the benefit is quite small. Overall it did not seem to slow down the rebuild (now space_rebuild + space_split) so I guess it can happily happen in the background.

    Updating Particle Counts is really a measure of the imbalance in the rebuild as it is the first global barrier after the rebuild has taken place. Maybe it's a sign that we slowed down all the communications in the rebuild because of the big all-reduce of the mesh? The same is true with Engine Recompute Displacement Constraint. The work in there is minimal but requires an all-reduce so likely more a sign of imbalance in what comes before (i.e. the rebuild).

  • Why would Engine Exchange Top Multipoles have increased so much?

  • That is the first synchronous MPI call, so will have an equivalent to MPI_Wait inside, which as we know has the side-effect of progressing any asynchronous MPI work. Effectively we move the work from earlier to there. My suspicion is that the other MPI calling functions are also taking some of this work as well.

    To test these ideas I've hacked the code to include some MPI_Test calls, I expect that will show a different pattern to how the work is distributed. It might go away from this analysis, i.e. not in a timed function, but still take the same runtime. We'll see.

    Edited by Peter W. Draper
  • They could be made asynchronous if need be. But indeed, if as you suspect, we need to ping th library to make the main call progress, then keeping them as they are might be useful.

  • OK, so with a number of MPI_Test calls I do see some differences:

    +------------------------------------------+-----------+-------------+------------------------+
    | function                                 | test_time | branch_time | time_diff              |
    +------------------------------------------+-----------+-------------+------------------------+
    | Updating Particle Counts                 | 59.3705   | 122.843     | -63.472500000000004    |
    | Engine Recompute Displacement Constraint | 37.8639   | 82.298      | -44.4341               |
    | Engine Exchange Top Multipoles           | 192.5501  | 205.3428    | -12.792700000000025    |

    but the time taken is roughly the same, so as I suspected we need to progress the call anyway, easy to hide in the engine, but less so here. Given that the original took 400s, the actual improvement is slight. Hah well. I suppose we could hope that smarter cards will make this faster, but it is far from clear if that would be best thought of asynchronously or not.

  • Should we then keep only the first three commits? The Green function speed-up is still useful.

  • Agreed, I tried to speed this up with a dedicated thread, but that broke as it seems that you cannot have an MPI_Allreduce waiting while others start in other threads, they just started accepting each others requests. Also tried an asynchronous call with polling MPI_Test, which at least worked, but same result in terms of a speed-up, so there is no way to hide this work.

  • Ok. I'll cherry-pick the relevant commits into a separate branch. As for the communication, maybe the global variable I_MPI_ADJUST_ALLREDUCE=4 that @jch identified as beneficial on Curie could help on cosma as well.

  • Matthieu Schaller mentioned in merge request !1015 (merged)

    mentioned in merge request !1015 (merged)

  • Got a job queued to see what I_MPI_ADJUST_ALLREDUCE=4 does.

  • I think that doubling the time taken for 'Mesh Comunication ' (sic) from 408s to 806s isn't an effect we're after. Sure that value was 4?

    Admittedly only using 32 ranks and EAGLE_50, so I guess this might not gain from the topology feature!

  • Indeed not a good improvement. Maybe the topology of the network is different on Curie? John was also using a lot more nodes for the same size grid (1248^3) so maybe mode 4 is more efficient at higher rank count?

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
Please register or sign in to reply
Loading