Do the cell unpacking in parallel by having one thread in the threapool per request/proxy
First implementation of #684.
Merge request reports
Activity
- Resolved by Matthieu Schaller
- Resolved by Matthieu Schaller
Do you mean having the "master" thread do the MPI_Test polling and unlocking threads to work on individual proxy when they have arrived? That was my original idea but it would require some threading infrastructure that we do not have at hand at the moment. We would need some sort of hybrid between the task and threadpool system I think.
No in this MR you would replace the call to
MPI_Wait
withMPI_Test
, but we'd need to loop while not all the requests in this chunk had been completed.So each thread has a list of proxy requests to complete, and these are disjoint across the threads. Could be unbalanced if all the early requests landed on one thread, but going further would require a lot more work.
added 11 commits
-
e2e2f36a...6b9a0345 - 10 commits from branch
master
- 927f8d4c - Merge branch 'master' into parallel_exchange_cells
-
e2e2f36a...6b9a0345 - 10 commits from branch
Obviously I'm assuming !1101 (merged) is used (although I imagined just setting the stride to
m/n
).That sounds better indeed. I'll give this a go.
I have tested this branch as-is on the EAGLE-50 low-z using 16 ranks with 14 threads each on 8 nodes.
- master: 1.51s on average for
engine_exchange_cells()
- branch: 1.18s on average for
engine_exchange_cells()
Firstly, it does not hang; that's a win these days. It also shows somewhat of an improvement. I can check test cases where even more time is spent in the function. But before that I should add timers to see whether we spend time waiting for MPI or unpacking.
- master: 1.51s on average for
Obviously I'm assuming !1101 (merged) is used (although I imagined just setting the stride to m/n).
Setting a chunk size in the
threadpool_mapper
call actually only sets an upper bound on the effective chunk size, which is chosen as a function of the number of threads and the amount of data left to work on.Additionally, setting a fixed stride to
m/n
with m=10 and n=4 would either cause chunks like[2, 2, 2, 2] + [2]
or, if rounded up,[3, 3, 3, 1]
, both of which are sub-optimal.Just as well one of us remembers this stuff! So we do need !1101 (merged) for
MPI_Test
based solutions.added 19 commits
-
927f8d4c...adc69183 - 17 commits from branch
master
- d77e00dd - Merge branch 'master' into parallel_exchange_cells
- dbc3fc1a - Change the strategy in the proxy_unpack_mapper to test a request and then move…
-
927f8d4c...adc69183 - 17 commits from branch
added 1 commit
- bc54ad67 - Do the exact same thing for the unpacking of cell tags