Skip to content
Snippets Groups Projects

Resource reuse

Open Pedro Gonnet requested to merge resource_reuse into master

Actually also contains corrections to the paper since I'm still too dumb do use git correctly...

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Author Developer

    Seems to work (tested on my laptop only), but I'm not sure how much of a performance boost this gives. Will test on 64cores and cosma-f as soon as I can!

  • Pedro Gonnet Added 1 commit:

    Added 1 commit:

  • Author Developer

    OK, tested on 64 cores, this solves all our problems. Aidan, can you have a look, just to be sure, and merge?

  • Yeah I'll test it today/Monday and merge it

  • I see we added a data pointer and size to qsched_addres / struct res - are these actually used anywhere or are they just to make the representation clearer (and less distinct from the MPI version)?

    On GTX690 the BH (1M randomly uniformly distributed particles) with this version takes 1828.205ms with 4 threads, whilst the master branch takes 1797.659ms. With 2 threads the new version takes 3594.454 ms whilst the master branch takes 3590.812ms. I'll test it on 64 cores after the meeting.

  • Aidan Chalk Added 1 commit:

    Added 1 commit:

  • Ok - I'm not totally convinced by this, though I don't think the error is due to these changes.

    If I run:

    ./test_bh -n 1000000 -t 2 on 64cores, then the code successfully executed with no issues.

    However if I run:

    ./test_bh -n 1000000 -t 32 on 64cores, then the code crashes with a Floating point exception. I'm not sure why this would occur only with more threads...

  • I just checked GDB

    Program received signal SIGFPE, Arithmetic exception.
    [Switching to Thread 0x7fffcb9a6700 (LWP 8281)]
    0x000000000040d1b2 in queue_task_overlap ()
    (gdb) where
    #0  0x000000000040d1b2 in queue_task_overlap ()
    #1  0x000000000040d7fd in queue_get ()
    #2  0x000000000040a5d2 in qsched_gettask ()
    #3  0x000000000040ab45 in qsched_pthread_run ()
    #4  0x0000003829e079d1 in start_thread (arg=0x7fffcb9a6700)
        at pthread_create.c:301
    #5  0x0000003829ae89dd in clone ()
        at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

    Looking at it, if

    res_union == res_isect 

    then

    return ((float)res_isect) / (res_union - res_isect);

    will divide by 0 and crash.

    I think we want the task with the most overlap to the prior task, so I think

    if (nr_res_a == 0 || nr_res_b == 0) return nr_res_a == nr_res_b ? 1.0f : 0.0f;

    should be

    if (nr_res_a == 0 || nr_res_b == 0) return 0.0f;

    and

    if (res_union == 0) return 1.0f;

    should be

    if (res_union == 0) return 0.0f;

    Finally before

    return ((float)res_isect) / (res_union - res_isect);

    we should check

    if(res_union == res_isect) return 1.0f;
  • Ok - I can't debug it with gcc-5.2 for some reason (it won't find the debug information inside the pthreads inside the library, though it finds the rest fine (including for pthreads in the test_bh.c code))

    With 4.4 I have done, and by compiling with -O0, I did find:

    res_union == res_isect

    And I'm pretty confident that it is due to:

          if (ra->data <= rb->data && rb->data < ra->data + ra->size) {
            if (rb->data + rb->size < ra->data + ra->size)
              res_isect += rb->size;
            else
              res_isect += ra->data + ra->size - rb->data;
          } else if (rb->data <= ra->data && ra->data < rb->data + rb->size) {
            if (ra->data + ra->size < rb->data + rb->size)
              res_isect += ra->size;
            else
              res_isect += rb->data + rb->size - ra->data;
          }

    Now if I check

    (gdb) print res_a[0].data <= res_b[0].data && res_b[0].data < res_a[0].data + res_a[0].size
    $12 = 1
    
    (gdb) print res_a[0].data <= res_b[1].data && res_b[1].data < res_a[0].data + res_a[0].size
    $13 = 1

    so for both of res_b's 2 resources, res_a[0] is inside it.

    The same is also true for res_a[1].

    Since

    (gdb) print res_a[0]->size
    $4 = 90720
    (gdb) print res_a[1]->size
    $5 = 90720
    (gdb) print res_b[0]->size
    $8 = 90720
    (gdb) print res_b[1]->size
    $9 = 90720

    this results in res_isect == res_union, as we add 90720 to res_isect 4 times.

  • And also, the slightly obvious concern of:

    (gdb) print res_a[0] == res_a[1]
    $19 = 1
    (gdb) print res_a[0] == res_b[0]
    $20 = 1
    (gdb) print res_a[1] == res_b[1]
    $22 = 1
    print ta->type
    $26 = 1
    (gdb) print tb->type
    $27 = 1
    (gdb) print ta == tb
    $28 = 0

    task type 1 is task_type_pair

  • Author Developer

    So we've got the same task using the same resource twice? Possibly locking and using the same thing, as locking twice wouldn't work?

  • Yeah I think so, except I can't see that in the code...

            /* Create the task. */
            tid = qsched_addtask(s, task_type_pair, task_flag_none, data,
                                 sizeof(struct cell *) * 2, ci->count * cj->count);
    
            /* Add the resources. */
            qsched_addlock(s, tid, ci->res);
            qsched_addlock(s, tid, cj->res);
  • Its also weird this only arises with 9+ threads.

  • Author Developer

    OK, can you verify that after running, each task still locks/uses the resources it should? Could still be a bug somewhere in sorting the locks/uses.

  • Ok - I think its not in the resources/uses sorting.

    (gdb) print s->res[tb->locks[0]]
    $7 = {lock = 1, hold = 0, owner = 11, parent = 32774, data = 0x7fffc8f93360, 
      size = 94512}
    (gdb) print s->res[tb->locks[1]]
    $8 = {lock = 0, hold = 0, owner = 11, parent = 32774, data = 0x7fffc8fc18c0, 
      size = 97296}
    (gdb) print s->res[ta->locks[1]]
    $9 = {lock = 0, hold = 0, owner = 11, parent = 32773, data = 0x7fffc8f659d0, 
      size = 94608}
    (gdb) print s->res[ta->locks[0]]
    $10 = {lock = 0, hold = 0, owner = 11, parent = 32773, data = 0x7fffc8f38bb0, 
      size = 93840}
    (gdb) print res_isect
    $11 = 398016
    (gdb) print res_union
    $12 = 398016
    (gdb) print res_a[0]
    $13 = (struct res *) 0x7fffc5bb6010
    (gdb) print *res_a[0]
    $14 = {lock = 1, hold = 0, owner = 2, parent = 4693, data = 0x7fffc6d16240, 
      size = 99504}

    notably

    (gdb) print *res_a[0]
    $14 = {lock = 1, hold = 0, owner = 2, parent = 4693, data = 0x7fffc6d16240, 
      size = 99504}

    is neither

    (gdb) print s->res[ta->locks[1]]
    $9 = {lock = 0, hold = 0, owner = 11, parent = 32773, data = 0x7fffc8f659d0, 
      size = 94608}
    (gdb) print s->res[ta->locks[0]]
    $10 = {lock = 0, hold = 0, owner = 11, parent = 32773, data = 0x7fffc8f38bb0, 
      size = 93840}
  • Author Developer

    OK, so whatever I'm doing when computing the overlap is wrong. Thanks for tracking that down!

  • I think

      for (int k = 0; k < ta->nr_locks; k++)
        res_a[k] = &s->res[s->locks[ta->locks[k]]];

    should just be

      for (int k = 0; k < ta->nr_locks; k++)
        res_a[k] = &s->res[ta->locks[k]];

    as

    (gdb) print ta->locks - s->locks
    $22 = 43074

    and t->locks = &s->locks[ ind ]; is set earlier.

  • Aidan Chalk Added 1 commit:

    Added 1 commit:

    • be009de7 - Potential fix to the queue_task_overlap function
  • Author Developer

    Yup, that's probably correct.

  • I pushed a fix for the bug - can you test it for the QR and see if we still have the improvement?

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
Please register or sign in to reply
Loading