Implement lock free subcell splitting and other speed ups.
Based on work in the zoom-master branch by Matthieu and Will
Avoids locking the memory used for subcells during spliting by having a pool of memory for each threadpool thread. In simple tests this speeds things up nicely, especially during step 0.
Also speeds the engine by using a more uniform and randomly assigned runner to a cell (the owner) and using more information about the weights when scheduling tasks.
An EAGLE_50 volume ran on a single COSMA 8 node shows speed ups of the order 20% over the initial 128 steps. This is also faster than 8xMPI on a node, but the reasons for that are more nuanced and may be less for a proper MPI run (since for instance the MPI limits for fastest possible step and communications within a step will return).
Merge request reports
Activity
assigned to @pdraper
Played around with this lock-less subcell memory allocation and it is clearly a good thing from a speed up perspective. You need to think about this issue of the cell owner being the threadpool tid, that is OK when these are the same in quantity as the runners. Probably serious imbalance issue when that isn't true, maybe I've missed something.
added 1 commit
- f7fd4dc5 - Don't use the cell owner to index the subcell memory, there is a conflict with...
I've implemented the subcell memory using the threadpool id and another variable, so the owner is now free to go back to normal service. This is just as fast as before.
As a caveat, I put the normal owner code back and that was in fact slower, this is something I also noticed in the other things I'm looking at. On inspection the selected owners were not very uniform, if that was the intention, there is no obvious memory locality at work either, so may as well just let the scheduler pick a runner. With that in mind we could also remove all the owner code.
I have one more trick which I think I'll add here and then this code is faster than the current 8xmpi on a single node. At least for my hydro/self-gravity/stars EAGLE_50 low-z test...
added 1 commit
- 81389e85 - Accumulate weights over all dependencies, not just the maximum weight, this...
Revisited:
As a caveat, I put the normal owner code back and that was in fact slower, this is something I also noticed in the other things I'm looking at. On inspection the selected owners were not very uniform, if that was the intention, there is no obvious memory locality at work either, so may as well just let the scheduler pick a runner. With that in mind we could also remove all the owner code.
and it isn't as simple as that. I recovered some of the owner behaviour by setting the cell owner to the runner chosen when scheduling, and that is also slower than doing nothing, at least when pinning of any sort is used, that is with or without memory interleaving. By slower we are looking at 20s in 600 (no interleave) and 14s in 400 (interleave) for step 0. Must mean that in some sense pinning manages to put threads in a worse position wrt to accessing the cell data than not. Not sure I understand that, but does indicate that we remove that code completely.
added 3 commits
-
81389e85...dd746319 - 2 commits from branch
master
- 464084bc - Merge branch 'master' into lock-free-subcell-splitting
-
81389e85...dd746319 - 2 commits from branch
I like your way of getting a per-thread id better than mine. So we should keep that for sure.
Also, generally, the faster rebuild is a good thing. So we should keep that.
I'll need to think about the owner. I may be missing something but I'd think that having the thread/core that allocated the memory as owner would be better. Though I guess these threads are not allocating the particles...
Yes, it must be something like that, but the picture is complicated, when you also consider that the kernel may be moving memory between NUMA nodes and the default layout is interleaved 4K pages per NUMA node. Looked at some breakdowns of the times in steps and tasks yesterday and didn't get a clear picture of what was important.
Edited by Peter W. Draper@bvandenbroucke it may be interesting to revisit the question of how many ranks / node is best for colibre with these changes in.
- Resolved by Peter W. Draper
- Resolved by Peter W. Draper
added 1 commit
- 49233253 - Don't need a lock on space_recycle as only called from the threadpool, so...