Summary of Eurohack17
We were not able to profile our code using the MegaKernel™ due to CUDA limitations, therefore our work was focused on tasks.
All the speedup are given in comparison to the naive version and are from not fully optimized code.
What we have tried:
- Shared memory gives a speedup of about 3x (for self density with shared memory fitting the cell size)^1.
- Symmetry about 1.5x
- Sorted computation 70x
What tricks we have learn:
- Should avoid to have threads in wraps waiting (e.g. in loop
if i == j; continue, increase manually i and avoid making the thread waiting on the others)
What we are working on:
- Shared memory with a smaller size than the cell size
^1 Number from memory, therefore may not be exact