Do the cell unpacking in parallel by having one thread in the threapool per request/proxy

I think Matthieu is correct and the "any" waits are not thread safe, so we shouldn't use that. A simple "wait" is OK, but I thought we'd chunk over requests and have a thread local loop that polled over the local requests using MPI_Test, much like the engine, just thread local.

Do you mean having the "master" thread do the MPI_Test polling and unlocking threads to work on individual proxy when they have arrived? That was my original idea but it would require some threading infrastructure that we do not have at hand at the moment. We would need some sort of hybrid between the task and threadpool system I think.

added 1 commit

e2e2f36a - Use a ptrdiff_t instead of an int

Compare with previous version

No in this MR you would replace the call to MPI_Wait with MPI_Test, but we'd need to loop while not all the requests in this chunk had been completed.

So each thread has a list of proxy requests to complete, and these are disjoint across the threads. Could be unbalanced if all the early requests landed on one thread, but going further would require a lot more work.

added 11 commits

e2e2f36a...6b9a0345 - 10 commits from branch master
927f8d4c - Merge branch 'master' into parallel_exchange_cells

Compare with previous version

Obviously I'm assuming !1101 (merged) is used (although I imagined just setting the stride to m/n).

That sounds better indeed. I'll give this a go.

I have tested this branch as-is on the EAGLE-50 low-z using 16 ranks with 14 threads each on 8 nodes.

master: 1.51s on average for engine_exchange_cells()
branch: 1.18s on average for engine_exchange_cells()

Firstly, it does not hang; that's a win these days. It also shows somewhat of an improvement. I can check test cases where even more time is spent in the function. But before that I should add timers to see whether we spend time waiting for MPI or unpacking.

resolved all discussions

Also, the same can be done when receiving the cell tags.

Obviously I'm assuming !1101 (merged) is used (although I imagined just setting the stride to m/n).

Setting a chunk size in the threadpool_mapper call actually only sets an upper bound on the effective chunk size, which is chosen as a function of the number of threads and the amount of data left to work on.

Additionally, setting a fixed stride to m/n with m=10 and n=4 would either cause chunks like [2, 2, 2, 2] + [2] or, if rounded up, [3, 3, 3, 1], both of which are sub-optimal.

Just as well one of us remembers this stuff! So we do need !1101 (merged) for MPI_Test based solutions.

added 19 commits

927f8d4c...adc69183 - 17 commits from branch master
d77e00dd - Merge branch 'master' into parallel_exchange_cells
dbc3fc1a - Change the strategy in the proxy_unpack_mapper to test a request and then move…

Compare with previous version

Here is now a version that uses MPI_Test() and moves on if unsuccessful.

added 1 commit

bc54ad67 - Do the exact same thing for the unpacking of cell tags

Compare with previous version

And now also for the cell tags.

This is embarrassing actually. The implementation is inefficient and leaks memory...

added 2 commits

9c1e51bf - Use a local list that reduces in size instead of looping over and over again on…
bd421e6b - For the tag unpacking use the auto chunk size as we have many more elements to…

Compare with previous version

Now, that's more respectable.

Do the cell unpacking in parallel by having one thread in the threapool per request/proxy

Merge request reports

Activity