Calls to fmaxf() in critical code

@nnrw56 @jwillis @pdraper

Here is to keep track of things. I have taken the scalar Gadget2 version of SPH code from the master and done the following change on lines 471 and 472:

//pi->force.v_sig = (pi->force.v_sig > v_sig) ? pi->force.v_sig : v_sig;
//pj->force.v_sig = (pj->force.v_sig > v_sig) ? pj->force.v_sig : v_sig;
pi->force.v_sig = fmaxf(pi->force.v_sig,  v_sig);
pj->force.v_sig = fmaxf(pj->force.v_sig,  v_sig);

That's a replacement of the ternary operator by a call to fmaxf as was present in the code up to last week.

According to the GCC documentation (https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html), these calls should be replaced by builtin operations.

Compiling with gcc 4.8 using the options -std=gnu99 -g -O0 -O3 -fomit-frame-pointer -malign-double -fstrict-aliasing -ffast-math -march=corei7-avx -mavx -fno-tree-vectorize -Wall -Wextra -Wno-unused-parameter -Werror (i.e. what comes out of the config script), I get the following assembly code:

0x410997 471 vmaxssl  0x5c(%rax), %xmm5, %xmm0
0x4109a1 471 vmovssl  %xmm0, 0x5c(%rax)
0x4109a6 472 vmaxssl  0x5c(%rdx), %xmm5, %xmm5
0x4109ab 472 vmovssl  %xmm5, 0x5c(%rdx)

as expected.

Now, when I compile with the Intel compiler 2016 update 3 using the options -std=gnu99 -g -no-prec-div -static -fbuiltin -fp-model fast=2 -O3 -ansi_alias -xAVX -no-vec -no-simd -pthread -w2 -Wunused-variable -Werror, I get the following assembly code:

0x427e2b 471 vmovssl  0x5c(%rbx,%r12,1), %xmm0
0x427e54 471 vmovssl  %xmm1, 0x160(%rsp)
0x427e5d 471 callq  0x458c60 <fmaxf> 
0x427e62 472 vmovssl  0xdc(%r14,%r15,1), %xmm2
0x427e6c 472 vmovssl  0x160(%rsp), %xmm1
0x427e75 471 vmovssl  %xmm0, 0x5c(%rbx,%r12,1)
0x427e7c 472 vmovaps %xmm2, %xmm0
0x427e80 472 callq  0x458c60 <fmaxf>
0x427ec2 472 vmovssl  %xmm0, 0xdc(%r14,%r15,1)

So there clearly is a function call in there and that is hindering us. On the EAGLE_25 case with 16 cores and 4096 steps (swift -s -t 16 eagle_25.yml -a -n 4096), we go from 1395s of wallclock time with gcc to 1800s with the Intel compiler. A shocking 30% loss of performance.

The version with ternary operators with the Intel compiler and same flags compiles to:

0x426661 469 vmovssl  0xdc(%r11,%r14,1), %xmm10
0x426670 469 vmaxss %xmm12, %xmm10, %xmm10
0x426675 468 vmovssl  0x5c(%rdx,%r12,1), %xmm8
0x426683 468 vmaxss %xmm12, %xmm8, %xmm8
0x4266de 468 vmovssl  %xmm8, 0x5c(%rdx,%r12,1)
0x4266e5 469 vmovssl  %xmm10, 0xdc(%r11,%r14,1)

with a built-in max but not the same one as the GCC-produced assembly... But that code is now as fast as the GCC version given above.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information