Draft: Implement a memcpy() clone that uses the threadpool

assigned to @pdraper

added 6 commits

6de3e3ad...e876eb0f - 5 commits from branch master
b037e0de - Merge remote-tracking branch 'origin/master' into threadpool_memcpy

@matthieu @dc-rope1 assuming this worked (!), that step, 28, dropped from 67s to 38s. Still twice the time taken by that MPI rank. That is when I use it to replace the memcpy()s in engine_split_particles().

Wow!

That memcpy can also be useful elsewhere I reckon.

Tried it out on other parts of that section, but it didn't help. Turns out the biggest remaining non-threadpool part of that step is the space particle sorting, others are the re-weights and ranking, so I guess that is as good as it gets.

While at it, can we make use of alignment information? Some memcpy() are faster when working on aligned blocks.

The memcpy() in space_allocate_extras() would also benefit from this. And possibly also in space_rebuild() after the exchange of strays.

added 11 commits

b037e0de...c2b024a3 - 9 commits from branch master
3361db51 - Add threadpool mappers for scheduler_reweight
d35e06f8 - Merge remote-tracking branch 'origin/master' into threadpool_memcpy

Compare with previous version

added 1 commit

5684af6c - Revert "Add threadpool mappers for scheduler_reweight"

Compare with previous version

added 1 commit

35ee0ef3 - Need to protect against owner for cj also being -1

Compare with previous version

added 1 commit

f6353070 - Issues found by sanitizer

Compare with previous version

@dc-rope1 you might be interested in this as well.

Tried this out in various parts of SWIFT with very mixed results, so wanted to see if any obvious heuristics could be used to work out when it is worth using. To that end I wrote a simple program that just did the memcpy using various numbers of threads and data sizes (the data is spread over all the threads, rather than copying a fixed size per thread, that is typical for SWIFT). Here are some interesting plots that show how these two variables work out on COSMA8 (AMD 128 cores):

What we are seeing here is the data copied (logged) in GB versus the data rate of the copy. Each discrete size is copied a number of times from 1 to 128 threads (steps of 2). The number of threads used is both the colour scale and the size of the marker. Some scaled markers are too small to be seen so have a fixed size green marker as well.

So we can see that it is not worth using a threaded copy until we get to quite large sizes, and should only use all 128 threads for sizes greater than 1GB. Below that fewer threads are always better.

Here is the same plot with a log Y axis as well.

Clearly makes the point that using all the threads for smaller copies also anti-scales, so is a very bad idea.

Raw data: explore.log-full-post

Similar fun for COSMA7:

The bottom axis is slightly different this time, instead of GB it is in 4096 bytes, so a page. This is the size that things like interleaving happen. I've also highlighted a single thread and 28 threads to more easily guide the eye. The size of the circles is proportional to the number of threads as before.

Quite interesting how one thread can be faster and then suddenly slower just after 10 pages. I think this is the gap where we transition to having at least one page per thread, so 28. After that multiple threads can win. The difference seems to be more, than COSMA8 (although I may update that plot) so after this threshold you may as well use all the threads it seems.

Now if we also log the Y axis: