Skip to content
Snippets Groups Projects

Implement lock free subcell splitting and other speed ups.

Merged Peter W. Draper requested to merge lock-free-subcell-splitting into master

Based on work in the zoom-master branch by Matthieu and Will

Avoids locking the memory used for subcells during spliting by having a pool of memory for each threadpool thread. In simple tests this speeds things up nicely, especially during step 0.

Also speeds the engine by using a more uniform and randomly assigned runner to a cell (the owner) and using more information about the weights when scheduling tasks.

An EAGLE_50 volume ran on a single COSMA 8 node shows speed ups of the order 20% over the initial 128 steps. This is also faster than 8xMPI on a node, but the reasons for that are more nuanced and may be less for a proper MPI run (since for instance the MPI limits for fastest possible step and communications within a step will return).

Edited by Peter W. Draper

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • assigned to @pdraper

  • Peter W. Draper changed the description

    changed the description

  • @matthieu @dc-rope1

    Played around with this lock-less subcell memory allocation and it is clearly a good thing from a speed up perspective. You need to think about this issue of the cell owner being the threadpool tid, that is OK when these are the same in quantity as the runners. Probably serious imbalance issue when that isn't true, maybe I've missed something.

  • Great that it shows as an improvement too in your test.

    Regarding the owners, we could try to set the ownership as in master. This way we only have the benefit of the parallel rebuild and can disentangle the two changes.

  • added 1 commit

    • f7fd4dc5 - Don't use the cell owner to index the subcell memory, there is a conflict with...

    Compare with previous version

  • I've implemented the subcell memory using the threadpool id and another variable, so the owner is now free to go back to normal service. This is just as fast as before.

    As a caveat, I put the normal owner code back and that was in fact slower, this is something I also noticed in the other things I'm looking at. On inspection the selected owners were not very uniform, if that was the intention, there is no obvious memory locality at work either, so may as well just let the scheduler pick a runner. With that in mind we could also remove all the owner code.

    I have one more trick which I think I'll add here and then this code is faster than the current 8xmpi on a single node. At least for my hydro/self-gravity/stars EAGLE_50 low-z test...

  • added 1 commit

    • 81389e85 - Accumulate weights over all dependencies, not just the maximum weight, this...

    Compare with previous version

  • There it is. The reweighting uses more information about dependencies, that gives the engine a boost.

    Would be interesting to see if you also see this.

  • BTW, this does require --interleave as well.

  • Revisited:

    As a caveat, I put the normal owner code back and that was in fact slower, this is something I also noticed in the other things I'm looking at. On inspection the selected owners were not very uniform, if that was the intention, there is no obvious memory locality at work either, so may as well just let the scheduler pick a runner. With that in mind we could also remove all the owner code.

    and it isn't as simple as that. I recovered some of the owner behaviour by setting the cell owner to the runner chosen when scheduling, and that is also slower than doing nothing, at least when pinning of any sort is used, that is with or without memory interleaving. By slower we are looking at 20s in 600 (no interleave) and 14s in 400 (interleave) for step 0. Must mean that in some sense pinning manages to put threads in a worse position wrt to accessing the cell data than not. Not sure I understand that, but does indicate that we remove that code completely.

  • Peter W. Draper added 3 commits

    added 3 commits

    Compare with previous version

  • I like your way of getting a per-thread id better than mine. So we should keep that for sure.

    Also, generally, the faster rebuild is a good thing. So we should keep that.

    I'll need to think about the owner. I may be missing something but I'd think that having the thread/core that allocated the memory as owner would be better. Though I guess these threads are not allocating the particles...

  • Yes, it must be something like that, but the picture is complicated, when you also consider that the kernel may be moving memory between NUMA nodes and the default layout is interleaved 4K pages per NUMA node. Looked at some breakdowns of the times in steps and tasks yesterday and didn't get a clear picture of what was important.

    Edited by Peter W. Draper
  • We have no issues here if the number of runner threads and pool threads is different, right?

  • Yes, this version doesn't have that issue.

  • Peter W. Draper changed the description

    changed the description

  • @bvandenbroucke it may be interesting to revisit the question of how many ranks / node is best for colibre with these changes in.

  • added 1 commit

    • 49233253 - Don't need a lock on space_recycle as only called from the threadpool, so...

    Compare with previous version

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading