Only use the AVX512 version of memswap when compiling with the Intel compiler as GCC seems to produce invalid code.
Also allocate the offset arrays using an aligned allocation.
Fixes #428 (closed) and #430 (closed).