Skip to content
Snippets Groups Projects

Do the cell unpacking in parallel by having one thread in the threapool per request/proxy

Closed Matthieu Schaller requested to merge parallel_exchange_cells into master

First implementation of #684.

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Pedro Gonnet
  • I think Matthieu is correct and the "any" waits are not thread safe, so we shouldn't use that. A simple "wait" is OK, but I thought we'd chunk over requests and have a thread local loop that polled over the local requests using MPI_Test, much like the engine, just thread local.

  • Do you mean having the "master" thread do the MPI_Test polling and unlocking threads to work on individual proxy when they have arrived? That was my original idea but it would require some threading infrastructure that we do not have at hand at the moment. We would need some sort of hybrid between the task and threadpool system I think.

  • added 1 commit

    • e2e2f36a - Use a ptrdiff_t instead of an int

    Compare with previous version

  • No in this MR you would replace the call to MPI_Wait with MPI_Test, but we'd need to loop while not all the requests in this chunk had been completed.

    So each thread has a list of proxy requests to complete, and these are disjoint across the threads. Could be unbalanced if all the early requests landed on one thread, but going further would require a lot more work.

  • Matthieu Schaller added 11 commits

    added 11 commits

    Compare with previous version

  • Obviously I'm assuming !1101 (merged) is used (although I imagined just setting the stride to m/n).

  • That sounds better indeed. I'll give this a go.

    I have tested this branch as-is on the EAGLE-50 low-z using 16 ranks with 14 threads each on 8 nodes.

    • master: 1.51s on average for engine_exchange_cells()
    • branch: 1.18s on average for engine_exchange_cells()

    Firstly, it does not hang; that's a win these days. It also shows somewhat of an improvement. I can check test cases where even more time is spent in the function. But before that I should add timers to see whether we spend time waiting for MPI or unpacking.

  • Matthieu Schaller resolved all discussions

    resolved all discussions

  • Also, the same can be done when receiving the cell tags.

  • Obviously I'm assuming !1101 (merged) is used (although I imagined just setting the stride to m/n).

    Setting a chunk size in the threadpool_mapper call actually only sets an upper bound on the effective chunk size, which is chosen as a function of the number of threads and the amount of data left to work on.

    Additionally, setting a fixed stride to m/n with m=10 and n=4 would either cause chunks like [2, 2, 2, 2] + [2] or, if rounded up, [3, 3, 3, 1], both of which are sub-optimal.

  • Just as well one of us remembers this stuff! So we do need !1101 (merged) for MPI_Test based solutions.

  • Matthieu Schaller added 19 commits

    added 19 commits

    • 927f8d4c...adc69183 - 17 commits from branch master
    • d77e00dd - Merge branch 'master' into parallel_exchange_cells
    • dbc3fc1a - Change the strategy in the proxy_unpack_mapper to test a request and then move…

    Compare with previous version

  • Here is now a version that uses MPI_Test() and moves on if unsuccessful.

  • added 1 commit

    • bc54ad67 - Do the exact same thing for the unpacking of cell tags

    Compare with previous version

  • And now also for the cell tags.

  • This is embarrassing actually. The implementation is inefficient and leaks memory...

  • added 2 commits

    • 9c1e51bf - Use a local list that reduces in size instead of looping over and over again on…
    • bd421e6b - For the tag unpacking use the auto chunk size as we have many more elements to…

    Compare with previous version

  • Now, that's more respectable.

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading