Calls to fmaxf() in critical code
Here is to keep track of things. I have taken the scalar Gadget2 version of SPH code from the master and done the following change on lines 471 and 472:
//pi->force.v_sig = (pi->force.v_sig > v_sig) ? pi->force.v_sig : v_sig;
//pj->force.v_sig = (pj->force.v_sig > v_sig) ? pj->force.v_sig : v_sig;
pi->force.v_sig = fmaxf(pi->force.v_sig, v_sig);
pj->force.v_sig = fmaxf(pj->force.v_sig, v_sig);
That's a replacement of the ternary operator by a call to fmaxf
as was present in the code up to last week.
According to the GCC documentation (https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html), these calls should be replaced by builtin operations.
Compiling with gcc 4.8 using the options -std=gnu99 -g -O0 -O3 -fomit-frame-pointer -malign-double -fstrict-aliasing -ffast-math -march=corei7-avx -mavx -fno-tree-vectorize -Wall -Wextra -Wno-unused-parameter -Werror
(i.e. what comes out of the config script), I get the following assembly code:
0x410997 471 vmaxssl 0x5c(%rax), %xmm5, %xmm0
0x4109a1 471 vmovssl %xmm0, 0x5c(%rax)
0x4109a6 472 vmaxssl 0x5c(%rdx), %xmm5, %xmm5
0x4109ab 472 vmovssl %xmm5, 0x5c(%rdx)
as expected.
Now, when I compile with the Intel compiler 2016 update 3 using the options -std=gnu99 -g -no-prec-div -static -fbuiltin -fp-model fast=2 -O3 -ansi_alias -xAVX -no-vec -no-simd -pthread -w2 -Wunused-variable -Werror
, I get the following assembly code:
0x427e2b 471 vmovssl 0x5c(%rbx,%r12,1), %xmm0
0x427e54 471 vmovssl %xmm1, 0x160(%rsp)
0x427e5d 471 callq 0x458c60 <fmaxf>
0x427e62 472 vmovssl 0xdc(%r14,%r15,1), %xmm2
0x427e6c 472 vmovssl 0x160(%rsp), %xmm1
0x427e75 471 vmovssl %xmm0, 0x5c(%rbx,%r12,1)
0x427e7c 472 vmovaps %xmm2, %xmm0
0x427e80 472 callq 0x458c60 <fmaxf>
0x427ec2 472 vmovssl %xmm0, 0xdc(%r14,%r15,1)
So there clearly is a function call in there and that is hindering us. On the EAGLE_25 case with 16 cores and 4096 steps (swift -s -t 16 eagle_25.yml -a -n 4096
), we go from 1395s of wallclock time with gcc to 1800s with the Intel compiler. A shocking 30% loss of performance.
The version with ternary operators with the Intel compiler and same flags compiles to:
0x426661 469 vmovssl 0xdc(%r11,%r14,1), %xmm10
0x426670 469 vmaxss %xmm12, %xmm10, %xmm10
0x426675 468 vmovssl 0x5c(%rdx,%r12,1), %xmm8
0x426683 468 vmaxss %xmm12, %xmm8, %xmm8
0x4266de 468 vmovssl %xmm8, 0x5c(%rdx,%r12,1)
0x4266e5 469 vmovssl %xmm10, 0xdc(%r11,%r14,1)
with a built-in max
but not the same one as the GCC-produced assembly... But that code is now as fast as the GCC version given above.