Add ParMETIS support
Adds ParMETIS repartitioning to calculate the new cell graph across all the MPI nodes.
ParMETIS also has methods that refine an existing solution, not just create a new one, which can be used to reduce the amount of particle movement.
As part of this work we are reducing the number of repartitioning techniques offered (these tend to give marginally worse solutions in tests) to "costs/costs", "costs/none", "none/costs" and "costs/time", that is balanced, vertex only, edge only and edges weighted by the expected time of next updates.
Initially this work intended to remove support for METIS, but testing the actual runtimes shows that that produces the best balance, so we are keeping that and offering it as a continuing option, just enhanced by the ParMETIS options. This may prove to be a better choice at larger scales than our current simulations.
Other significant updates in this request:
-
initial partitioning schemes given more obvious names:
- grid, region, memory or vectorized region balances by volume, and memory by particle distribution, the others are only interesting when (Par)METIS is not available.
-
weights are now calculated using floats and the sum is scaled into the range of
idx_t
. That should avoid integer overflow issues. -
Weights are no longer used from any MPI tasks. The position of these is strictly a free parameter of the solution.
-
The balance of weights between vertices and timebins is defined as an equipartition. Previously the limits were matched.
-
new clocks function
clocks_random_seed
to return "random" seeds based on the remainder of the current number of nanoseconds.
Merge request reports
Activity
mentioned in merge request !503 (closed)
added 1 commit
- c69715c8 - Revert "Use threadpool to speed up particle loops"
added 1 commit
- 624012ee - Undo debugging change that always reports numbers of particles moved
added 89 commits
-
a9b78167...079f0ea3 - 88 commits from branch
master
- 986ab6d1 - Merge remote-tracking branch 'origin/master' into parmetis-perm
-
a9b78167...079f0ea3 - 88 commits from branch
Back to WIP: as the balance in longer runs isn't as good as master. More investigation required, but the most likely culprit must be the initial partition since this difference can be seen in the first steps. It isn't possible to get the same partition as serial METIS as the options[] array tuning isn't available in ParMETIS.
added 115 commits
-
3c383661...a3866435 - 114 commits from branch
master
- 34aa1a37 - Merge remote-tracking branch 'origin/master' into parmetis-perm
-
3c383661...a3866435 - 114 commits from branch
added 59 commits
-
9dc16374...a727dda0 - 56 commits from branch
master
- 1a9680cb - Change balanced initial partition to memory, should be more obvious
- 346b3b77 - Use METIS for the initial partition and when not 'refining', this gives a better global solution
- af9cb136 - Merge remote-tracking branch 'origin/master' into parmetis-perm
Toggle commit list-
9dc16374...a727dda0 - 56 commits from branch
added 1 commit
- 7fc8e370 - Make the itr value a parameter, it will almost certainly benefit from tuning
added 40 commits
-
7fc8e370...07f1e5b1 - 39 commits from branch
master
- 4a8fc023 - Merge remote-tracking branch 'origin/master' into parmetis-perm
-
7fc8e370...07f1e5b1 - 39 commits from branch
Still looking into why the balance using this scheme is not as good, here are some plots that may give a clue. These use my old technique of dumping the cells at the end of each step, extracting those that are top-level and active and working out which are on the edges of the domains, i.e. will be doing MPI. The y-axis is just the step and the x-axis the ratio of active cells to cells, points are coloured by how many active cells in the step and their is a histogram showing the density of points (not always visible). For a run of EAGLE_50 on 8 nodes for 8192 steps that is forced to repartition after 1000 steps we get the following, first for master:
now this branch:
added 17 commits
-
8a463288...7b0aa8f0 - 16 commits from branch
master
- e7b09db5 - Merge remote-tracking branch 'origin/master' into parmetis-perm
-
8a463288...7b0aa8f0 - 16 commits from branch
added 1 commit
- 14671b49 - Generalise process_cells for any number of ranks
added 692 commits
-
d7c24ae2...e95331dd - 690 commits from branch
master
- 4a18404b - Merge remote-tracking branch 'origin/master' into parmetis-perm
- dc8a6050 - Formatting
-
d7c24ae2...e95331dd - 690 commits from branch
added 48 commits
-
67e9b219...e9d05916 - 47 commits from branch
master
- de28106f - Merge branch 'master' into parmetis-perm
-
67e9b219...e9d05916 - 47 commits from branch
added 1 commit
- 3717d3ad - Bring back METIS as a configure option and allow adaptive and normal refinement with ParMETIS
added 4 commits
- 551894c3 - Merge remote-tracking branch 'origin/master' into parmetis-perm
- bbefa289 - Fix the sum of weights sent to PARMETIS/METIS so that we avoid overflows
- 1a44222f - Merge remote-tracking branch 'origin/master' into parmetis-perm-test
- 0aa63048 - Merge in changes that limit sum of weights to less than METIS idx_t range
Toggle commit listadded 1 commit
- f3a06c4b - Make use of METIS an option when ParMETIS is configured
added 1 commit
- b0258bb5 - Need pick_metis() when only have ParMETIS and usemetis is true
added 1 commit
- 67bd6d88 - Keep timebin weights in some balance with vertex weights
added 1 commit
- 5a2a6074 - Add function to give a different seed for randoms
added 1 commit
- 2dcd0dff - Iterate a few solutions using ParMETIS when not refining
added 188 commits
-
2dcd0dff...4669036f - 187 commits from branch
master
- 2c149381 - Merge remote-tracking branch 'origin/master' into parmetis-perm
-
2dcd0dff...4669036f - 187 commits from branch
added 1 commit
- 1358ac9a - Fix bug sending regionids to other ranks and don't use initial partition as basis for refinement
After a lot of effort here is a plot of the final "solution". First without logging the Y axis, so range is reduced from full:
Now with the usual log Y axis and all the data:
The result is that although we see gains by not moving particles around so much they are very small compared to the variations we see in the overall balance, so the three schemes (METIS, ParMETIS adaptive, ParMETIS refine) are pretty much the same within the measurement error. The plan is now to just offer ParMETIS as an option for testing as we move to larger simulations (these measurements are done using EAGLE_50 on 8 nodes of COSMA6, just SPH as MPI Gravity isn't available at present).
added 112 commits
-
45c63d97...50457d68 - 111 commits from branch
master
- 3e6f8b0c - Merge remote-tracking branch 'origin/master' into parmetis-perm
-
45c63d97...50457d68 - 111 commits from branch
For interest I've been running some jobs and looking for indications about where our balance is, and how that is affecting the runtime, comparing the master (simple METIS) and this code (hybrid ParMETIS/METIS). Not managed to draw any useful conclusions, but the results are interesting enough to share. These are from two run for 3 hours on the same nodes 8 of COSMA6, SPH only, EAGLE 50.
First the simulation time achieved comparisons. Done using steps v internal time:
Surprisingly this code performs more steps and proceeds further into the simulation.
We can see this if we plot the cumulative time in steps against the simulation time.
So the speed up comes early and stays with us.
Now a favourite plot, number of updated particles versus the time per step:
Some indication of why this is true. Clearly the shortest time steps are running faster, but when?
It seems at all times, but this plot also shows that we do better, some of the time at other step loci as well.
Burrowing deeper, for these runs I also output the real CPU time (user) per step as a min, max and mean across the ranks. This allows us to ask how well is our scaling for these different timesteps.
CPU estimate ParMETIS/METIS METIS Mean ! ! Max ! ! Min ! ! So here we have simulation time against the ratio of CPU time to time per step. Note that the CPU time is divided by the number of engine threads, so 16 in this case, that is only approximate as we have the threadpool 16 as well, but these two thread groups rarely operate at the same time (the only instance is when initially queuing tasks). The size of the circle is a function of the number of particle updates in the step, so as we expect the large circles are more efficient. I think you can see that the ParMETIS run is more efficient at small and intermediary steps, but stare closely.
The really interesting plots are the min/max ones. These directly show the imbalance between the nodes. Maybe we should be doing better than this, especially for the larger steps.
Edited by Peter W. Draperadded 122 commits
-
3e6f8b0c...79c022ee - 119 commits from branch
master
- 45a713e9 - Merge remote-tracking branch 'origin/master' into parmetis-perm
- 9b6b6704 - Merge remote-tracking branch 'origin/master' into parmetis-perm
- e8c760fe - Still need celllist when using plain METIS
Toggle commit list-
3e6f8b0c...79c022ee - 119 commits from branch
Thanks for the very detailed analysis.
I have been running some EAGLE-50 examples with gravity over MPI and I find that the time for the small step is about 500ms, which is about 20x worse than in EAGLE_25. I look forward to trying out the parametis version and use that to recalibrate the gravity task weights.
@nnrw56 you will like these plots.
@matthieu, yes, I do! Although I'm not sure I understand all of them.
So the main message is that ParMETIS now does better than METIS, and this is mainly due to the smaller timesteps? If so, then that is excellent news, and means that it is working as expected!
I'm not quite sure I understand the final set of plots, though I probably would get them framed and hung in my office :) It looks like, on average, we're getting better CPU usage for ParMETIS, which means we're probably waiting less on send/recv tasks, which is consistent with my interpretation of the first two plots. Is that correct?
@pdraper, do you have a plot, or at least the data to make one, showing the actual amount of data sent back and forth for each time step? My guess is that we should see ParMETIS sending fewer particles than METIS, because it builds and maintains a better partition, which is the main source of improvement for the other results.
In any case, awesome stuff!
Yes, ParMETIS seems to do better overall.
We could possibly improve things on the smallest steps:
Our interpretation is that it is related to the fact that on the smallest steps we still try to run with 16 threads only if there is only one cell to update and hence only one chain of dependencies. Looking at task plots we see that on small steps multiple threads run these tasks. So the minimum time is not 0 and the max is not 1. Whilst it could be. We could consider unleashing only 1 thread on the task graph when there is little to do. That would reduce the contention between threads, improve cache re-use and allow the one and only running core to clock up.