WIP: Repartition using CPU ticks
Change repartitioning so that we only use CPU ticks as the weights.
In tests this seems to be sufficient and can give the best quality partitions, so reinstate the use of ticks to weight the graph that is partitioned by METIS.
The changes are to reintroduce measuring CPU ticks for all tasks by default (we
keep --enable-task-debugging
as that requires other fields), but removing any
attempt to force repartitioning, we only do it when enough information about the
tasks is available, i.e. after steps that interact most particles.
This decouples task weights from repartitioning as scheduling needs to be done with partial information, for that we still have an issue with tasks that do not have characteristic functions that scale with particle counts.
Merge request reports
Activity
The competition to this technique is currently the
repart-by-means branch
(i.e. fixed costs), that also gives good partitions, but in tests of EAGLE_50, with hydro, self gravity and stars, this method is 7.5% more efficient.The test used was to run with 8 ranks on a single node of COSMA7 for 48 hours. During that time this technique was 2% slower to the first proper repartition (at step 4125), but after that only repartitioned twice, compared with 15, so got better balance which was maintained.
You can see this in the plot above, that shows simulation time against time spent in the step. We are slower to the first repartition, then are generally better (pay attention to the upper values).
The optimal solution might be a hybrid of these approaches, but the fixed costs method needs tuning for the simulation, so in principle you need to get to the first repartition anyway.
@matthieu fancy giving this a try on one of your Millenium runs. With this and the recent changes to the initial partition we should look pretty good.
Just to be sure, this is using the per-task ticks, or the average ticks per task type over all ranks?
Using the raw ticks had the problem that these were not always available for all tasks, or has that problem been solved?
In any case, it would be awesome to go back to the "pure" solution of using ticks :)
Yes, we have gone backwards once more, and this is using per-task ticks.
We "solve" the problem of tick availability by waiting until they are available. This is at the same time we usually assess the balance anyway, so the question boils down to can we afford to wait that long.
So we give up the ability to force a repartition.
I'm considering a hybrid approach. The things that are holding me back are necessity (do we really need it), the extra complexity introduced and I'd like master to have a better scheme now.
Complexity because it is looking like we need to produce a tuned set of averages for each type of run. For instance there is a lot of variation between say EAGLE_6 and EAGLE_25, (that looks like it is down to the details of how the task splitting and recursion is working). Then we have the fact that tasks that can do more than one job, e.g. the kicks do work for self-gravity and hydro, not to mention all the different types of cooling I haven't even looked at...
I have code in SWIFT that splits out the averages, so you can run once get these values, recompile and then do the "faster" run. Phew.
So as the utimate solution we now have the ability to mix fixed_costs and ticks. So that is three options:
- just ticks
- just fixed costs
- fixed costs then ticks
The
signal
we are looking for is hard to find, so I've needed to go to extreme measures and run 20 jobs of each type for 5000 steps, so we get just past the first non-forced repartition (at step 4126) and then take the mean of these. That gives us:So you can see that good fixed costs can do better up to this non-forced repartition!
The fixed costs themselves are now output at each repartition as a file
partition_fixed_costs.h
which is the name of the dummy version in thesrc/
directory. So in principle these can just be copied into place and used in the next code build, but as I have said before, these are marginal gains.An interesting plot is this one:
this shows the ratio of minimum time to maximum time in the 20 runs per step, coloured by how long the step took. Quite a bit of variation, especially for the short steps.
That isn't MPI, as all these jobs ran on a single node of COSMA7, just with 8 ranks, so must be down to the partitioning and how that impacts the workload.
mentioned in merge request !707 (merged)