Only repartition when required
Only repartition when the previous step processed some large fraction of all the particles, and then only when the loads between the ranks are out of balance. This is for several reasons:
- Repartitioning is expensive, so should only be done when necessary.
- Frequent repartitioning with multi-dt is not necessary (for the EAGLE volumes anyway).
- It is more representative to check the load balance when all tasks have been ran.
The load balance is determined from the user CPU time per step (including the CPU time from all threads). We exclude the system time as that is not down to processing and tends to even out the ranks artificially, much as elapsed time does (since we wait for all the MPI tasks to come together).
The load imbalance allowed is determined by the parameter DomainDecomposition:trigger
,
this can also be set to a number greater than one, in which case the old
repartitioning scheme of every 'trigger' steps will be used (previously trigger
was always 100).
Merge request reports
Activity
Added 4 commits:
Toggle commit listReassigned to @nnrw56
@nnrw56 probably time you had a look at this. I probably need to do some more evaluation, but how does all seem to you?
Added 32 commits:
-
26b99d75...190a201e - 31 commits from branch
master
- 6eb3ff13 - Merge remote-tracking branch 'origin/master' into repartition-less
-
26b99d75...190a201e - 31 commits from branch
Added 401 commits:
-
6eb3ff13...f9c1d350 - 400 commits from branch
master
- d7164853 - Merge remote-tracking branch 'origin/master' into repartition-less
-
6eb3ff13...f9c1d350 - 400 commits from branch
Added 1 commit:
- ceaaa6bf - Fix typo in fractionaltime
Now we can do longer runs here are some plots of time per step against step. First using the existing repartitioning scheme:
The repartition steps are the green squares.
Now for the current branch:
and finally my most recent tweak (were we do a repartition after the second step regardless):
Don't get too excited about the seeming longer run, that turned out to be more about the filesystem behaviour, the actually speed up is more like 7%.
I've repeated the above now the filesystem is more stable and reduced the number of nodes down to 12 from 20. The message remains the same, the balance remains stable for fewer repartitions, which gives a speed up in wall clock of around 7%.
Since @rgb asked for a binned version of the plot to make the variations more obvious, here is an attempt for the new runs.
This shows the median value per 100 steps, for a classic run with repartitioning every 100 steps in blue, and the new code in red. The points at which a new repartition was performed are the green crosses. That seems to show that the balance was worse until the second repartition (the first is a little hidden at step 3), but we are largely as good afterwards. The extra steps are the speed up.
Here are the raw data. Just to for completeness.
Added 51 commits:
-
ceaaa6bf...a679332b - 49 commits from branch
master
- d67758ac - Merge remote-tracking branch 'origin/master' into repartition-less
- bd803b48 - Extend possible schemes to include a number of steps as well as a
-
ceaaa6bf...a679332b - 49 commits from branch
Added 1 commit:
- 5ddd0761 - Formatting
Added 84 commits:
-
5ddd0761...3b974424 - 83 commits from branch
master
- 45d4b79d - Merge remote-tracking branch 'origin/master' into repartition-less
-
5ddd0761...3b974424 - 83 commits from branch
Added 1 commit:
- d76b4aa1 - Stop repartitioning in the step after a repartition, that makes no sense and is …
This looks ready to go now, so please have a look and re-assign to me or @matthieu for merging.
3151 e->forcerepart = 1; 3152 } 3153 } 3154 3155 #ifdef SWIFT_DEBUG_TASKS 3156 /* Save the cputimes for analysis. */ 3157 fprintf(e->file_cputimes, "%6d ", e->step); 3158 for (int k = 0; k < e->nr_nodes; k++) { 3159 fprintf(e->file_cputimes, " %14.7g", elapsed_cputimes[k]); 3160 } 3161 fprintf(e->file_cputimes, "\n"); 3162 fflush(e->file_cputimes); 3163 #endif 3164 } 3165 } 3166 Added 1 commit:
- fc3e4d3e - Move logic about whether to trigger a repartition into function and remove the
Added 1 commit:
- 52b2e419 - Formatting