This is inspired by what you reported on the Optane. By reverting the way we loop, we gain around 20% on cosma7 in this function call. It's not game changing but still worth the change.
On EAGLE-25 with one node (28 threads) and the Intel compiler with the usual flags we go from ~510ms per call to ~400ms.
Happy to hear any thoughts you may have on this.