Skip to content
Snippets Groups Projects
Open Idea: Manual page management for NUMA improvement
  • View options
  • Idea: Manual page management for NUMA improvement

  • View options
  • Open Issue created by Aidan Chalk

    So this is mostly hypothesized and may work badly, but we currently have a strategy for data reuse (and hopefully some NUMA benefits) where we look for tasks that overlap with previously used tasks, but work on SWIFT shows this may not shore up many of the weaknesses of the code with NUMA.

    Since linux uses first touch, and most of our examples/likely uses cases would use the master thread to allocate memory, all of the pages associated with the work (at depending on size of problem) may be stuck on NUMA node 0, meaning all the threads on NUMA node 1 have longer accesses times to memory.

    Since many of the branches have a void* data pointer associated with the resource, at qsched_run time we can find out the memory page(s) associated with each resource (this is the yucky bit involving linux kernel level stuff from page.h, with #define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)), and from that, the NUMA node that contains the page with move_pages from libnuma. We can then manually balance the resources on each NUMA node using move_pages allowing each resource to be allocated to a NUMA node. Each thread should also be pinned to a core, which has a NUMA node associated with it, giving each queue an associated NUMA node. We then initially place tasks in a queue that is "best", i.e. the one that has the most memory in a given NUMA node (if equal, do something with lock/uses comparison or just random). When stealing work, threads should first look in queues that are on the same NUMA node, before looking for ones on opposite NUMA nodes.

    This works fairly straightforwardly for "standard" CPUs, where libnuma is available (required) and where you have 2 numa nodes, where each core has distance 10 or >10 to the 2 nodes. We can also use this to decide how to do affinity. We also have to be aware that for small resources, many resources will be in the same page, so we need to keep track of pages we've seen, likely with a hashmap or similar.

    For a given resource with pointer void *data and int size, we can probably do (in a naive way ignoring repeated pages):

    
    const int page_size = numa_pagesize();
    int num_pages = (size/page_size)+1; //Approx, is size is exact multiple of page_size this would be wrong.
    char *cdata = (char) data;
    struct pages **pages = malloc(sizeof(struct page*)*num_pages);
    for(int i = 0; i < num_pages; i++){
      pages[i] = virt_to_page(cdata); //Only thing I'm not 100% sure of is how this works if cdata isn't the start of a page.
    }
    int numas[num_pages];
    long error = move_pages(0, (unsigned long) num_pages, pages, NULL, numas, MPOL_MF_MOVE);
    

    Function references: https://github.com/torvalds/linux/blob/ead751507de86d90fa250431e9990a8b881f713c/arch/x86/include/asm/page.h https://linux.die.net/man/2/move_pages

    • Merge request
    • Branch

    Linked items ... 0

  • Activity

    • All activity
    • Comments only
    • History only
    • Newest first
    • Oldest first
    Loading Loading Loading Loading Loading Loading Loading Loading Loading Loading