We were not able to profile our code using the MegaKernel™ due to CUDA limitations, therefore our work was focused on tasks.
All the speedup are given in comparison to the naive version and are from not fully optimized code.
What we have tried:
What tricks we have learn:
if i == j; continue
What we are working on:
^1 Number from memory, therefore may not be exact