Skip to content
Snippets Groups Projects

WIP: Repartition using CPU ticks

Closed Peter W. Draper requested to merge repart-by-ticks into master

Change repartitioning so that we only use CPU ticks as the weights.

In tests this seems to be sufficient and can give the best quality partitions, so reinstate the use of ticks to weight the graph that is partitioned by METIS.

The changes are to reintroduce measuring CPU ticks for all tasks by default (we keep --enable-task-debugging as that requires other fields), but removing any attempt to force repartitioning, we only do it when enough information about the tasks is available, i.e. after steps that interact most particles.

This decouples task weights from repartitioning as scheduling needs to be done with partial information, for that we still have an issue with tasks that do not have characteristic functions that scale with particle counts.

Edited by Peter W. Draper

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • The competition to this technique is currently the repart-by-means branch (i.e. fixed costs), that also gives good partitions, but in tests of EAGLE_50, with hydro, self gravity and stars, this method is 7.5% more efficient.

    The test used was to run with 8 ranks on a single node of COSMA7 for 48 hours. During that time this technique was 2% slower to the first proper repartition (at step 4125), but after that only repartitioned twice, compared with 15, so got better balance which was maintained.

    tick-v-fixed-costs

    You can see this in the plot above, that shows simulation time against time spent in the step. We are slower to the first repartition, then are generally better (pay attention to the upper values).

    The optimal solution might be a hybrid of these approaches, but the fixed costs method needs tuning for the simulation, so in principle you need to get to the first repartition anyway.

  • @matthieu fancy giving this a try on one of your Millenium runs. With this and the recent changes to the initial partition we should look pretty good.

  • Peter W. Draper changed title from WIP: Repart by ticks to WIP: Repartition using CPU ticks

    changed title from WIP: Repart by ticks to WIP: Repartition using CPU ticks

  • Peter W. Draper changed the description

    changed the description

  • Is there anything you would like to plot or some numbers you would be interested in?

  • Nothing much, just capture the log and timestep file so we can check what the balance looks like, particularly for up to the first repartition.

  • Just to be sure, this is using the per-task ticks, or the average ticks per task type over all ranks?

    Using the raw ticks had the problem that these were not always available for all tasks, or has that problem been solved?

    In any case, it would be awesome to go back to the "pure" solution of using ticks :)

  • Yes, we have gone backwards once more, and this is using per-task ticks.

    We "solve" the problem of tick availability by waiting until they are available. This is at the same time we usually assess the balance anyway, so the question boils down to can we afford to wait that long.

    So we give up the ability to force a repartition.

  • Well... In that case we could definitely default to the averaging scheme if we don't have the ticks for all tasks :)

  • I'm considering a hybrid approach. The things that are holding me back are necessity (do we really need it), the extra complexity introduced and I'd like master to have a better scheme now.

    Complexity because it is looking like we need to produce a tuned set of averages for each type of run. For instance there is a lot of variation between say EAGLE_6 and EAGLE_25, (that looks like it is down to the details of how the task splitting and recursion is working). Then we have the fact that tasks that can do more than one job, e.g. the kicks do work for self-gravity and hydro, not to mention all the different types of cooling I haven't even looked at...

    I have code in SWIFT that splits out the averages, so you can run once get these values, recompile and then do the "faster" run. Phew.

  • The good new (or bad news here) is that the Millennium run does not repartition as it never reaches the threshold. I can lower thre threshold to force repartitions more often. Would that be useful?

  • Is this using .1? Anyway sounds like we need to force the matter, so why not drop down to 0.01.

  • Yes. 0.1. I'll drop ot so we get some numbers.

  • So as the utimate solution we now have the ability to mix fixed_costs and ticks. So that is three options:

    • just ticks
    • just fixed costs
    • fixed costs then ticks

    The signal we are looking for is hard to find, so I've needed to go to extreme measures and run 20 jobs of each type for 5000 steps, so we get just past the first non-forced repartition (at step 4126) and then take the mean of these. That gives us:

    three-schemes

    So you can see that good fixed costs can do better up to this non-forced repartition!

    The fixed costs themselves are now output at each repartition as a file partition_fixed_costs.h which is the name of the dummy version in the src/ directory. So in principle these can just be copied into place and used in the next code build, but as I have said before, these are marginal gains.

    An interesting plot is this one:

    variation

    this shows the ratio of minimum time to maximum time in the 20 runs per step, coloured by how long the step took. Quite a bit of variation, especially for the short steps.

    That isn't MPI, as all these jobs ran on a single node of COSMA7, just with 8 ranks, so must be down to the partitioning and how that impacts the workload.

  • Peter W. Draper mentioned in merge request !707 (merged)

    mentioned in merge request !707 (merged)

Please register or sign in to reply
Loading