QuickSched issueshttps://gitlab.cosma.dur.ac.uk/swift/quicksched/-/issues2018-03-06T15:49:40Zhttps://gitlab.cosma.dur.ac.uk/swift/quicksched/-/issues/18Idea: Manual page management for NUMA improvement2018-03-06T15:49:40ZAidan ChalkIdea: Manual page management for NUMA improvementSo this is mostly hypothesized and may work badly, but we currently have a strategy for data reuse (and hopefully some NUMA benefits) where we look for tasks that overlap with previously used tasks, but work on SWIFT shows this may not s...So this is mostly hypothesized and may work badly, but we currently have a strategy for data reuse (and hopefully some NUMA benefits) where we look for tasks that overlap with previously used tasks, but work on SWIFT shows this may not shore up many of the weaknesses of the code with NUMA.
Since linux uses first touch, and most of our examples/likely uses cases would use the master thread to allocate memory, all of the pages associated with the work (at depending on size of problem) may be stuck on NUMA node 0, meaning all the threads on NUMA node 1 have longer accesses times to memory.
Since many of the branches have a `void* data` pointer associated with the resource, at `qsched_run` time we can find out the memory page(s) associated with each resource (this is the yucky bit involving linux kernel level stuff from page.h, with `#define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)`), and from that, the NUMA node that contains the page with `move_pages` from libnuma. We can then manually balance the resources on each NUMA node using `move_pages` allowing each resource to be allocated to a NUMA node. Each thread should also be pinned to a core, which has a NUMA node associated with it, giving each queue an associated NUMA node. We then initially place tasks in a queue that is "best", i.e. the one that has the most memory in a given NUMA node (if equal, do something with lock/uses comparison or just random). When stealing work, threads should first look in queues that are on the same NUMA node, before looking for ones on opposite NUMA nodes.
This works fairly straightforwardly for "standard" CPUs, where libnuma is available (required) and where you have 2 numa nodes, where each core has distance 10 or >10 to the 2 nodes. We can also use this to decide how to do affinity. We also have to be aware that for small resources, many resources will be in the same page, so we need to keep track of pages we've seen, likely with a hashmap or similar.
For a given resource with pointer `void *data` and `int size`, we can probably do (in a naive way ignoring repeated pages):
```C
const int page_size = numa_pagesize();
int num_pages = (size/page_size)+1; //Approx, is size is exact multiple of page_size this would be wrong.
char *cdata = (char) data;
struct pages **pages = malloc(sizeof(struct page*)*num_pages);
for(int i = 0; i < num_pages; i++){
pages[i] = virt_to_page(cdata); //Only thing I'm not 100% sure of is how this works if cdata isn't the start of a page.
}
int numas[num_pages];
long error = move_pages(0, (unsigned long) num_pages, pages, NULL, numas, MPOL_MF_MOVE);
```
Function references:
https://github.com/torvalds/linux/blob/ead751507de86d90fa250431e9990a8b881f713c/arch/x86/include/asm/page.h
https://linux.die.net/man/2/move_pageshttps://gitlab.cosma.dur.ac.uk/swift/quicksched/-/issues/16OpenMP lock rather than gnu99 atomics2017-11-09T15:15:40ZAidan ChalkOpenMP lock rather than gnu99 atomicsShould we use OpenMP locks if we are using `HAVE_OPENMP` rather than gnu99 atomics and writing out own (when not using pthread locks)?
I have a draft of this but no proof it works well yet.Should we use OpenMP locks if we are using `HAVE_OPENMP` rather than gnu99 atomics and writing out own (when not using pthread locks)?
I have a draft of this but no proof it works well yet.https://gitlab.cosma.dur.ac.uk/swift/quicksched/-/issues/15Add Task Timers information to the code if a define is met. Add to configure2017-11-09T15:15:40ZAidan ChalkAdd Task Timers information to the code if a define is met. Add to configureWe use task plots a lot, it should be easy to allow quicksched to collect this data when we want it.
Should be enabled with something like `./configure --enable-task-timers` or similar.We use task plots a lot, it should be easy to allow quicksched to collect this data when we want it.
Should be enabled with something like `./configure --enable-task-timers` or similar.https://gitlab.cosma.dur.ac.uk/swift/quicksched/-/issues/14Improve work-stealing scheme2017-11-09T15:15:40ZAidan ChalkImprove work-stealing schemeThe work done in Munich on ls1-mardyn highlights a major issue with high thread counts - if only a small number of queues have work then a lot of time is spent accessing random empty queues and making no progress.
I'm unclear as to wh...The work done in Munich on ls1-mardyn highlights a major issue with high thread counts - if only a small number of queues have work then a lot of time is spent accessing random empty queues and making no progress.
I'm unclear as to what exactly is occurring but my guess would be something to do with:
```c
for ( naq = 0 , k = 0 ; k < s->nr_queues ; k++ )
if ( k != qid && s->queues[k].count > 0 )
qids[ naq++ ] = k;
```
is executed multiple times, as we don't distinguish between empty queues and queues that have work but none that we could lock at this time.
I think also if 60 threads have found 6 queues containing work, I wonder what percentage of them fail at
`tid = queue_get( &s->queues[ qids[k] ] , s , 0 );` and then mark the queue empty when they shouldn't. If they do this on all of those queues, then they will search their own queue, rebuild the non-empty list and only then search again for work (if using OpenMP this is very bad as there is no yielding, the threads just batter the queues until they find work).
I think we might want to change this
```c
while ( naq > 0 ) {
k = rand() % naq;
TIMER_TIC2
tid = queue_get( &s->queues[ qids[k] ] , s , 0 );
TIMER_TOC( s , qsched_timer_queue )
if ( tid < 0 )
qids[k] = qids[ --naq ];
else
break;
}
}
```
to:
```c
while ( naq > 0 ) {
k = rand() % naq;
TIMER_TIC2
tid = queue_get( &s->queues[ qids[k] ] , s , 0 );
TIMER_TOC( s , qsched_timer_queue )
if ( tid < 0 && s->queues[qids[k]].count == 0)
qids[k] = qids[ --naq ];
else if (tid >= 0)
break;
}
}
```
I don't think this change will necessarily do anything until we can use atomic_read with C11 (which I couldn't work out what needs changing in the autotools setup to switch from gnu99 to c11 or gnu11(?)), as it may just cache the count variable so get stuck.https://gitlab.cosma.dur.ac.uk/swift/quicksched/-/issues/13No way to get thread ID from QuickSched2017-11-09T15:15:40ZAidan ChalkNo way to get thread ID from QuickSchedThis is needed if you have a variable that is summed across the entire computation (for example energy in MD) to accumulate values in a thread private store that is then reduced globally.This is needed if you have a variable that is summed across the entire computation (for example energy in MD) to accumulate values in a thread private store that is then reduced globally.https://gitlab.cosma.dur.ac.uk/swift/quicksched/-/issues/12C11 compliance2017-11-09T15:15:40ZAidan ChalkC11 complianceMove from volatile to atomic_read etc.Move from volatile to atomic_read etc.https://gitlab.cosma.dur.ac.uk/swift/quicksched/-/issues/11QuickSched Paper2017-11-09T15:15:41ZAidan ChalkQuickSched PaperI will try to get the data plots required for the QuickSched paper soon as if someone else is publishing using our library it would help everyone out if we at least had a final preprint.
@matthieu if you want to set a specific deadlin...I will try to get the data plots required for the QuickSched paper soon as if someone else is publishing using our library it would help everyone out if we at least had a final preprint.
@matthieu if you want to set a specific deadline for me to aim for feel free to set a milestone and I'll get it done by then. Just need to not forget basically.Aidan ChalkAidan Chalkhttps://gitlab.cosma.dur.ac.uk/swift/quicksched/-/issues/10Work at enqueue for tasks (aimed at communication)2017-11-09T15:15:41ZAidan ChalkWork at enqueue for tasks (aimed at communication)So from discussions I had at the IXPUG meeting, one issue that people discussed with me was how they couldn't get MPI to work how they would like in whatever task implementation they were using.
I think extending quicksched's model to...So from discussions I had at the IXPUG meeting, one issue that people discussed with me was how they couldn't get MPI to work how they would like in whatever task implementation they were using.
I think extending quicksched's model to support a function call at enqueue (specified by the user, things such as MPI_Isend for example) might let us test this out and say we want this in whatever the task standard ends up being. Also a custom trylock function specific to task type.
Something like
`qsched_add_enqueue_function( function_pointer);`
`qsched_add_lock_function( function_pointer, int task_type);`
to be called by the user, and enabled or disabled at library compilation time (we could start our own sequence of difficult to remember defines, such as other major libraries...), but something like ./configure --with-enqueue-funcs . Would be restricted to 1 function for all tasks and would be similar to the function provided to `qsched_run`.
I don't think we need a `dequeue` function or `unlock` function at this time. This lets people do communication as follows, for e.g. a send task:
```
qsched_addtask(..., task_type_send, ...)
qsched_add_enqueue_function(&enqueue_func)
qsched_add_lock_function( &send_lock_func, task_type_send);
...
void enqueue_func(int type, int *data)
{
if(type == task_type_send)
{
MPI_ISend(...);
}
}
...
int send_lock_func(int *data)
{
return MPI_Test(...);
}
```
Thoughts @nnrw56 @matthieu - this should be sufficient for a front-end specification? The back end for the custom lock functions may be a bit complex and it all adds more overhead that for a basic shared memory computation is unwanted, so I think off-by-default is necessary.Aidan ChalkAidan Chalkhttps://gitlab.cosma.dur.ac.uk/swift/quicksched/-/issues/8Bug in multiple runs with the GPU version2017-11-09T15:15:41ZAidan ChalkBug in multiple runs with the GPU versionThere is some bug if multiple runs are specified which makes it crash.There is some bug if multiple runs are specified which makes it crash.Aidan ChalkAidan Chalkhttps://gitlab.cosma.dur.ac.uk/swift/quicksched/-/issues/2Gravity is biased2017-11-09T15:15:41ZMatthieu SchallerGravity is biasedCorrect the sorted interactions to reduce the biasCorrect the sorted interactions to reduce the biasMatthieu SchallerMatthieu Schaller