Improve work-stealing scheme
The work done in Munich on ls1-mardyn highlights a major issue with high thread counts - if only a small number of queues have work then a lot of time is spent accessing random empty queues and making no progress.
I'm unclear as to what exactly is occurring but my guess would be something to do with:
for ( naq = 0 , k = 0 ; k < s->nr_queues ; k++ )
if ( k != qid && s->queues[k].count > 0 )
qids[ naq++ ] = k;
is executed multiple times, as we don't distinguish between empty queues and queues that have work but none that we could lock at this time.
I think also if 60 threads have found 6 queues containing work, I wonder what percentage of them fail at
tid = queue_get( &s->queues[ qids[k] ] , s , 0 );
and then mark the queue empty when they shouldn't. If they do this on all of those queues, then they will search their own queue, rebuild the non-empty list and only then search again for work (if using OpenMP this is very bad as there is no yielding, the threads just batter the queues until they find work).
I think we might want to change this
while ( naq > 0 ) {
k = rand() % naq;
TIMER_TIC2
tid = queue_get( &s->queues[ qids[k] ] , s , 0 );
TIMER_TOC( s , qsched_timer_queue )
if ( tid < 0 )
qids[k] = qids[ --naq ];
else
break;
}
}
to:
while ( naq > 0 ) {
k = rand() % naq;
TIMER_TIC2
tid = queue_get( &s->queues[ qids[k] ] , s , 0 );
TIMER_TOC( s , qsched_timer_queue )
if ( tid < 0 && s->queues[qids[k]].count == 0)
qids[k] = qids[ --naq ];
else if (tid >= 0)
break;
}
}
I don't think this change will necessarily do anything until we can use atomic_read with C11 (which I couldn't work out what needs changing in the autotools setup to switch from gnu99 to c11 or gnu11(?)), as it may just cache the count variable so get stuck.