Better NUMA awareness
As discussed in today's meeting, we may want to try any of the following:
-
Allocate memory on a specific node: This is possible using the
libnuma
functionvoid *numa_alloc_onnode(size_t size, int node)
. -
Allocate memory interleaved over a set of nodes: This is also possible with
void *numa_alloc_interleaved_subset(size_t size, struct bitmask *nodemask)
. The allocated memory will be interleaved over the selected nodes page-by-page, which is probably not really what we want since each page is only a few kB (getconf PAGESIZE
will give you the value on your machine). I haven't been able to find a more flexible function withinlibnuma
. -
Get the distance between the CPU and a page of memory: This can also be done with
libnuma
using a combination ofmove_pages
orget_mempolicy
, which returns the node of a given page address,int numa_node_of_cpu(int cpu)
, which returns the NUMA node ID of a given CPU, and finallyint numa_distance(int node1, int node2)
, which computes the distance between both the page's and CPU's NUMA node IDs.
So I think libnuma
can satisfy most of our needs. The thing I'm missing is a numa_alloc_interleaved
function that doesn't work page-by-page, but in larger chunks. Apparently we can fake that by mmap
ping the memory and assigning it to nodes ourselves, but that's a somewhat big step.
What we could try for now is the following:
-
Allocate the particle arrays with
numa_alloc_interleaved_subset
with a bitmask of the CPUs used, without any control of assigning tasks to specific CPUs depending on the data location. This may already bring about an improvement if the current situation looks something like all memory being allocated on a single NUMA node and the bottleneck being that node's bandwidth. If the memory is spread out, each read from a distant CPU may take longer, but we won't choke on bandwidth at the NUMA node level. It could also be, though, that this is already the default behaviour in which case we won't see a difference. -
Allocate work arrays/buffers locally with
numa_alloc_onnode
. This is slower than regularmalloc
, but may pay off in terms of faster memory access. @jwillis, is this what you had already tried?
Both of these seem relatively simple to implement. Although they're not exactly what we want, they may be a step in the right direction!