Draft: Implement a memcpy() clone that uses the threadpool
Using multiple threads to copy data can make better use of the available memory bandwidth when multiple NUMA regions are being used. This is an attempt to exploit that.
Experiments show that using it should be done only for quite large data, so a heuristic is in place to support that, which requires a number of 4k pages, at least 1 seems to be required, which makes sense, but more are required in reality (COSMA8 AMD 128 core, 8 NUMA regions). Surprisingly even then using all the threads is still not a guarantor of the best speed, so we also apply a limit to the number of threads that will be used to 25% of the total, using more than 1 gives a good improvement which tails off rapidly.
The actual gains here are modest, the largest come from particle splitting, but only when the buffers need reallocating and replication, which is only used for testing.
So the question is, is this worth it...
Merge request reports
Activity
assigned to @pdraper
added 6 commits
-
6de3e3ad...e876eb0f - 5 commits from branch
master
- b037e0de - Merge remote-tracking branch 'origin/master' into threadpool_memcpy
-
6de3e3ad...e876eb0f - 5 commits from branch
@matthieu @dc-rope1 assuming this worked (!), that step, 28, dropped from 67s to 38s. Still twice the time taken by that MPI rank. That is when I use it to replace the
memcpy()
s inengine_split_particles()
.Edited by Peter W. Draperadded 11 commits
-
b037e0de...c2b024a3 - 9 commits from branch
master
- 3361db51 - Add threadpool mappers for scheduler_reweight
- d35e06f8 - Merge remote-tracking branch 'origin/master' into threadpool_memcpy
-
b037e0de...c2b024a3 - 9 commits from branch
added 1 commit
- 5684af6c - Revert "Add threadpool mappers for scheduler_reweight"
added 1 commit
- 35ee0ef3 - Need to protect against owner for cj also being -1
@dc-rope1 you might be interested in this as well.
Tried this out in various parts of SWIFT with very mixed results, so wanted to see if any obvious heuristics could be used to work out when it is worth using. To that end I wrote a simple program that just did the
memcpy
using various numbers of threads and data sizes (the data is spread over all the threads, rather than copying a fixed size per thread, that is typical for SWIFT). Here are some interesting plots that show how these two variables work out on COSMA8 (AMD 128 cores):What we are seeing here is the data copied (logged) in GB versus the data rate of the copy. Each discrete size is copied a number of times from 1 to 128 threads (steps of 2). The number of threads used is both the colour scale and the size of the marker. Some scaled markers are too small to be seen so have a fixed size green marker as well.
So we can see that it is not worth using a threaded copy until we get to quite large sizes, and should only use all 128 threads for sizes greater than 1GB. Below that fewer threads are always better.
Here is the same plot with a log Y axis as well.
Clearly makes the point that using all the threads for smaller copies also anti-scales, so is a very bad idea.
Raw data: explore.log-full-post
Similar fun for COSMA7:
The bottom axis is slightly different this time, instead of GB it is in 4096 bytes, so a page. This is the size that things like interleaving happen. I've also highlighted a single thread and 28 threads to more easily guide the eye. The size of the circles is proportional to the number of threads as before.
Quite interesting how one thread can be faster and then suddenly slower just after 10 pages. I think this is the gap where we transition to having at least one page per thread, so 28. After that multiple threads can win. The difference seems to be more, than COSMA8 (although I may update that plot) so after this threshold you may as well use all the threads it seems.
Now if we also log the Y axis:
Pretty clear that multiple threads accessing data at these scales is a bad idea. Could be part of why the threadpool behaves so oddly at times.
Rwa data:
added 58 commits
-
f6353070...231cdd06 - 57 commits from branch
master
- 37a823f4 - Merge branch 'master' into 'threadpool_memcpy'
-
f6353070...231cdd06 - 57 commits from branch
added 1 commit
- 97157b93 - Fix up bad master merge with removed functions
added Test feature request performance labels