Fully vectorized kernel functions
We would like the kernel functions to:
- Fully unroll.
- Fully vectorize using FMAs.
- Not contain un-aligned accesses if not necessary. The 3 Wendland kernels should be perfectly branch-free.
We have a ticket open with Intel to understand why the compiler can unroll OR vectorize but not both.