Profiling/minikernel implementation of self/pair + looking into improvements
From where @lhausammann and @jborrow started at the hackathon I think this is one of the main things to look into w.r.t GPU performance currently.
Current things to experiment with (probably in the order we should do them) :
-
Shared memory usage ( @lhausammann started looking into this and thinks its beneficial ) -
Sorted/unsorted pair interactions ( @lhausammann also started looking into this) -
Profiling of minikernels and megakernels ( @jborrow started looking at this) -
Subcelling of large self or pair interactions (similar to the CPU). This is a major bottleneck for the SodShock at current I believe, we interact cells with thousands of particles together with a naive n^2 approach, which results in a significant performance loss vs CPU.
My preferred route to do this would be to have someone else experiment with the first 2 from the above list and write short (few pages?) reports detailing what the findings were in terms of what works well and doesn't work well. Once we have these I can port the improvements back into the megakernel and see if they're beneficial while the last 2 points are started on the minikernels.
@matthieu does this sound reasonable?