Skip to content

Better NUMA awareness

As discussed in today's meeting, we may want to try any of the following:

  • Allocate memory on a specific node: This is possible using the libnuma function void *numa_alloc_onnode(size_t size, int node).
  • Allocate memory interleaved over a set of nodes: This is also possible with void *numa_alloc_interleaved_subset(size_t size, struct bitmask *nodemask). The allocated memory will be interleaved over the selected nodes page-by-page, which is probably not really what we want since each page is only a few kB (getconf PAGESIZE will give you the value on your machine). I haven't been able to find a more flexible function within libnuma.
  • Get the distance between the CPU and a page of memory: This can also be done with libnuma using a combination of move_pages or get_mempolicy, which returns the node of a given page address, int numa_node_of_cpu(int cpu), which returns the NUMA node ID of a given CPU, and finally int numa_distance(int node1, int node2), which computes the distance between both the page's and CPU's NUMA node IDs.

So I think libnuma can satisfy most of our needs. The thing I'm missing is a numa_alloc_interleaved function that doesn't work page-by-page, but in larger chunks. Apparently we can fake that by mmapping the memory and assigning it to nodes ourselves, but that's a somewhat big step.

What we could try for now is the following:

  • Allocate the particle arrays with numa_alloc_interleaved_subset with a bitmask of the CPUs used, without any control of assigning tasks to specific CPUs depending on the data location. This may already bring about an improvement if the current situation looks something like all memory being allocated on a single NUMA node and the bottleneck being that node's bandwidth. If the memory is spread out, each read from a distant CPU may take longer, but we won't choke on bandwidth at the NUMA node level. It could also be, though, that this is already the default behaviour in which case we won't see a difference.

  • Allocate work arrays/buffers locally with numa_alloc_onnode. This is slower than regular malloc, but may pay off in terms of faster memory access. @jwillis, is this what you had already tried?

Both of these seem relatively simple to implement. Although they're not exactly what we want, they may be a step in the right direction!

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information