GPU + MPI
We need to work out a strategy for doing the GPU + MPI within the current infrastructure.
The current idea is probably:
- Use current MPI tasks on host.
- Have a cell variable which lets the device know whether a cell has been received yet if non-local - the GPU checks this when looking at load tasks and places the task in the end of the queue if it has not been yet.
- Have an additional wait value on the MPI send tasks on the host mid-density. This is decremented on the device once the data for that task has been unloaded from the GPU after density. If the wait is 0 we have to find a way to put these into the host's queue (not worked it out). This also has an issue if the number of CPU threads is 1 it will not progress at current.
This feels a bit messy, a potential alternative (still messy) is:
- Ignore current MPI tasks on host for density (or gravity if we change what is run on GPU).
- Have a list of "ready to send cells" populated by the GPU and sent by the thread that launched the GPU kernel, and a list of "received cells" populated by the CPU thread and checked by the GPU as above. This thread does all of the communication required for the GPU.
I think the best (easiest) way to do multi-GPU + CPU stuff will be: 1 rank per GPU + 1 rank for any unused CPUs on the node - using multi GPUs in a single rank is probably overly complex with our model.
Any other ideas for this would be appreciated - using GPU direct to do the data transfer is also possible rather than manually copying back to the CPU and may be better when using 1 rank per GPU.