Draft: Implement a memcpy() clone that uses the threadpool
Using multiple threads to copy data can make better use of the available memory bandwidth when multiple NUMA regions are being used. This is an attempt to exploit that.
Experiments show that using it should be done only for quite large data, so a heuristic is in place to support that, which requires a number of 4k pages, at least 1 seems to be required, which makes sense, but more are required in reality (COSMA8 AMD 128 core, 8 NUMA regions). Surprisingly even then using all the threads is still not a guarantor of the best speed, so we also apply a limit to the number of threads that will be used to 25% of the total, using more than 1 gives a good improvement which tails off rapidly.
The actual gains here are modest, the largest come from particle splitting, but only when the buffers need reallocating and replication, which is only used for testing.
So the question is, is this worth it...