Resource reuse

Seems to work (tested on my laptop only), but I'm not sure how much of a performance boost this gives. Will test on 64cores and cosma-f as soon as I can!

Added 1 commit:

8cfb2b41 - fix union.

OK, tested on 64 cores, this solves all our problems. Aidan, can you have a look, just to be sure, and merge?

Yeah I'll test it today/Monday and merge it

I see we added a data pointer and size to qsched_addres / struct res - are these actually used anywhere or are they just to make the representation clearer (and less distinct from the MPI version)?

On GTX690 the BH (1M randomly uniformly distributed particles) with this version takes 1828.205ms with 4 threads, whilst the master branch takes 1797.659ms. With 2 threads the new version takes 3594.454 ms whilst the master branch takes 3590.812ms. I'll test it on 64 cores after the meeting.

Added 1 commit:

6ef308ab - Minor fix to the QR

Ok - I'm not totally convinced by this, though I don't think the error is due to these changes.

If I run:

./test_bh -n 1000000 -t 2 on 64cores, then the code successfully executed with no issues.

However if I run:

./test_bh -n 1000000 -t 32 on 64cores, then the code crashes with a Floating point exception. I'm not sure why this would occur only with more threads...

I just checked GDB

Program received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0x7fffcb9a6700 (LWP 8281)]
0x000000000040d1b2 in queue_task_overlap ()
(gdb) where
#0  0x000000000040d1b2 in queue_task_overlap ()
#1  0x000000000040d7fd in queue_get ()
#2  0x000000000040a5d2 in qsched_gettask ()
#3  0x000000000040ab45 in qsched_pthread_run ()
#4  0x0000003829e079d1 in start_thread (arg=0x7fffcb9a6700)
    at pthread_create.c:301
#5  0x0000003829ae89dd in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Looking at it, if

res_union == res_isect

then

return ((float)res_isect) / (res_union - res_isect);

will divide by 0 and crash.

I think we want the task with the most overlap to the prior task, so I think

if (nr_res_a == 0 || nr_res_b == 0) return nr_res_a == nr_res_b ? 1.0f : 0.0f;

should be

if (nr_res_a == 0 || nr_res_b == 0) return 0.0f;

and

if (res_union == 0) return 1.0f;

should be

if (res_union == 0) return 0.0f;

Finally before

return ((float)res_isect) / (res_union - res_isect);

we should check

if(res_union == res_isect) return 1.0f;

Ok - I can't debug it with gcc-5.2 for some reason (it won't find the debug information inside the pthreads inside the library, though it finds the rest fine (including for pthreads in the test_bh.c code))

With 4.4 I have done, and by compiling with -O0, I did find:

res_union == res_isect

And I'm pretty confident that it is due to:

      if (ra->data <= rb->data && rb->data < ra->data + ra->size) {
        if (rb->data + rb->size < ra->data + ra->size)
          res_isect += rb->size;
        else
          res_isect += ra->data + ra->size - rb->data;
      } else if (rb->data <= ra->data && ra->data < rb->data + rb->size) {
        if (ra->data + ra->size < rb->data + rb->size)
          res_isect += ra->size;
        else
          res_isect += rb->data + rb->size - ra->data;
      }

Now if I check

(gdb) print res_a[0].data <= res_b[0].data && res_b[0].data < res_a[0].data + res_a[0].size
$12 = 1

(gdb) print res_a[0].data <= res_b[1].data && res_b[1].data < res_a[0].data + res_a[0].size
$13 = 1

so for both of res_b's 2 resources, res_a[0] is inside it.

The same is also true for res_a[1].

Since

(gdb) print res_a[0]->size
$4 = 90720
(gdb) print res_a[1]->size
$5 = 90720
(gdb) print res_b[0]->size
$8 = 90720
(gdb) print res_b[1]->size
$9 = 90720

this results in res_isect == res_union, as we add 90720 to res_isect 4 times.

And also, the slightly obvious concern of:

(gdb) print res_a[0] == res_a[1]
$19 = 1
(gdb) print res_a[0] == res_b[0]
$20 = 1
(gdb) print res_a[1] == res_b[1]
$22 = 1

print ta->type
$26 = 1
(gdb) print tb->type
$27 = 1
(gdb) print ta == tb
$28 = 0

task type 1 is task_type_pair

So we've got the same task using the same resource twice? Possibly locking and using the same thing, as locking twice wouldn't work?

Yeah I think so, except I can't see that in the code...

        /* Create the task. */
        tid = qsched_addtask(s, task_type_pair, task_flag_none, data,
                             sizeof(struct cell *) * 2, ci->count * cj->count);

        /* Add the resources. */
        qsched_addlock(s, tid, ci->res);
        qsched_addlock(s, tid, cj->res);

Its also weird this only arises with 9+ threads.

OK, can you verify that after running, each task still locks/uses the resources it should? Could still be a bug somewhere in sorting the locks/uses.

Ok - I think its not in the resources/uses sorting.

(gdb) print s->res[tb->locks[0]]
$7 = {lock = 1, hold = 0, owner = 11, parent = 32774, data = 0x7fffc8f93360, 
  size = 94512}
(gdb) print s->res[tb->locks[1]]
$8 = {lock = 0, hold = 0, owner = 11, parent = 32774, data = 0x7fffc8fc18c0, 
  size = 97296}
(gdb) print s->res[ta->locks[1]]
$9 = {lock = 0, hold = 0, owner = 11, parent = 32773, data = 0x7fffc8f659d0, 
  size = 94608}
(gdb) print s->res[ta->locks[0]]
$10 = {lock = 0, hold = 0, owner = 11, parent = 32773, data = 0x7fffc8f38bb0, 
  size = 93840}
(gdb) print res_isect
$11 = 398016
(gdb) print res_union
$12 = 398016
(gdb) print res_a[0]
$13 = (struct res *) 0x7fffc5bb6010
(gdb) print *res_a[0]
$14 = {lock = 1, hold = 0, owner = 2, parent = 4693, data = 0x7fffc6d16240, 
  size = 99504}

notably

(gdb) print *res_a[0]
$14 = {lock = 1, hold = 0, owner = 2, parent = 4693, data = 0x7fffc6d16240, 
  size = 99504}

is neither

(gdb) print s->res[ta->locks[1]]
$9 = {lock = 0, hold = 0, owner = 11, parent = 32773, data = 0x7fffc8f659d0, 
  size = 94608}
(gdb) print s->res[ta->locks[0]]
$10 = {lock = 0, hold = 0, owner = 11, parent = 32773, data = 0x7fffc8f38bb0, 
  size = 93840}

OK, so whatever I'm doing when computing the overlap is wrong. Thanks for tracking that down!

I think

  for (int k = 0; k < ta->nr_locks; k++)
    res_a[k] = &s->res[s->locks[ta->locks[k]]];

should just be

  for (int k = 0; k < ta->nr_locks; k++)
    res_a[k] = &s->res[ta->locks[k]];

as

(gdb) print ta->locks - s->locks
$22 = 43074

and t->locks = &s->locks[ ind ]; is set earlier.

Added 1 commit:

be009de7 - Potential fix to the queue_task_overlap function

Yup, that's probably correct.

I pushed a fix for the bug - can you test it for the QR and see if we still have the improvement?

Resource reuse

Merge request reports

Activity