Implement lock free subcell splitting and other speed ups.

assigned to @pdraper

changed the description

Played around with this lock-less subcell memory allocation and it is clearly a good thing from a speed up perspective. You need to think about this issue of the cell owner being the threadpool tid, that is OK when these are the same in quantity as the runners. Probably serious imbalance issue when that isn't true, maybe I've missed something.

Great that it shows as an improvement too in your test.

Regarding the owners, we could try to set the ownership as in master. This way we only have the benefit of the parallel rebuild and can disentangle the two changes.

added 1 commit

f7fd4dc5 - Don't use the cell owner to index the subcell memory, there is a conflict with...

Compare with previous version

I've implemented the subcell memory using the threadpool id and another variable, so the owner is now free to go back to normal service. This is just as fast as before.

As a caveat, I put the normal owner code back and that was in fact slower, this is something I also noticed in the other things I'm looking at. On inspection the selected owners were not very uniform, if that was the intention, there is no obvious memory locality at work either, so may as well just let the scheduler pick a runner. With that in mind we could also remove all the owner code.

I have one more trick which I think I'll add here and then this code is faster than the current 8xmpi on a single node. At least for my hydro/self-gravity/stars EAGLE_50 low-z test...

added 1 commit

81389e85 - Accumulate weights over all dependencies, not just the maximum weight, this...

Compare with previous version

There it is. The reweighting uses more information about dependencies, that gives the engine a boost.

Would be interesting to see if you also see this.

BTW, this does require --interleave as well.

Revisited:

As a caveat, I put the normal owner code back and that was in fact slower, this is something I also noticed in the other things I'm looking at. On inspection the selected owners were not very uniform, if that was the intention, there is no obvious memory locality at work either, so may as well just let the scheduler pick a runner. With that in mind we could also remove all the owner code.

and it isn't as simple as that. I recovered some of the owner behaviour by setting the cell owner to the runner chosen when scheduling, and that is also slower than doing nothing, at least when pinning of any sort is used, that is with or without memory interleaving. By slower we are looking at 20s in 600 (no interleave) and 14s in 400 (interleave) for step 0. Must mean that in some sense pinning manages to put threads in a worse position wrt to accessing the cell data than not. Not sure I understand that, but does indicate that we remove that code completely.

added 3 commits

81389e85...dd746319 - 2 commits from branch master
464084bc - Merge branch 'master' into lock-free-subcell-splitting

Compare with previous version

I like your way of getting a per-thread id better than mine. So we should keep that for sure.

Also, generally, the faster rebuild is a good thing. So we should keep that.

I'll need to think about the owner. I may be missing something but I'd think that having the thread/core that allocated the memory as owner would be better. Though I guess these threads are not allocating the particles...

Yes, it must be something like that, but the picture is complicated, when you also consider that the kernel may be moving memory between NUMA nodes and the default layout is interleaved 4K pages per NUMA node. Looked at some breakdowns of the times in steps and tasks yesterday and didn't get a clear picture of what was important.

We have no issues here if the number of runner threads and pool threads is different, right?

Yes, this version doesn't have that issue.

changed the description

@bvandenbroucke it may be interesting to revisit the question of how many ranks / node is best for colibre with these changes in.

added 1 commit

49233253 - Don't need a lock on space_recycle as only called from the threadpool, so...

Compare with previous version

Implement lock free subcell splitting and other speed ups.

Merge request reports

Activity