GPU + MPI

We need to work out a strategy for doing the GPU + MPI within the current infrastructure.

The current idea is probably:

Use current MPI tasks on host.
Have a cell variable which lets the device know whether a cell has been received yet if non-local - the GPU checks this when looking at load tasks and places the task in the end of the queue if it has not been yet.
Have an additional wait value on the MPI send tasks on the host mid-density. This is decremented on the device once the data for that task has been unloaded from the GPU after density. If the wait is 0 we have to find a way to put these into the host's queue (not worked it out). This also has an issue if the number of CPU threads is 1 it will not progress at current.

This feels a bit messy, a potential alternative (still messy) is:

Ignore current MPI tasks on host for density (or gravity if we change what is run on GPU).
Have a list of "ready to send cells" populated by the GPU and sent by the thread that launched the GPU kernel, and a list of "received cells" populated by the CPU thread and checked by the GPU as above. This thread does all of the communication required for the GPU.

I think the best (easiest) way to do multi-GPU + CPU stuff will be: 1 rank per GPU + 1 rank for any unused CPUs on the node - using multi GPUs in a single rank is probably overly complex with our model.

Any other ideas for this would be appreciated - using GPU direct to do the data transfer is also possible rather than manually copying back to the CPU and may be better when using 1 rank per GPU.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information