Improvements to proxy_cells_exchange() ?

This is reviving the old topic of the slow proxy cell exchange.

The main cost in this function comes from the last loop around line 465 of proxy.c. We tried a somewhat parallel version of the loop before with MPI calls inside a thread-parallel call but failed as it was violating the standard. Here are some other ideas:

Every time MPI_WaitAny() returns the index of a comm that has completed, we fire off a separate thread to deal with the actual unpacking of that proxy. In this way we don't have to wait until the unpack is done to start the next proxy.
We replace MPI_WairAny() by MPI_WaitAll() and when everything has arrived, we use the threadpool for the actual unpack.

It may be that this is not enough as we could have only few proxies but many cells per proxy. In which case, we should probably use the threadpool over the cells rather than proxies. That would require to somehow transmit the counts and not do it in a cumulative way any more.

@pdraper @nnrw56 any thoughts before I start hacking too much?

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information