Draft: Ragged one side registered buffers.
Attempt at using a technique that uses the local memory buffers instead of a large receive buffer per subtype. If this works (worried it will run out of registered memory handles) then we can share the particle buffers with MPI and use DMA to avoid copying....
Works with Intel MPI 2022, but not as fast as point to point. Fails with OpenMPI which has a hard limit on the number of regions that can be attached to a window (OMPI_OSC_UCX_ATTACH_MAX I think).
Back to the drawing board.
Edited by Peter W. Draper