Looking at some of the scaling results, it turns out that the last remaining significant chunk of non-parallel code is
space_rebuild() and the majority of the time in there is spent computing the cell index of the particles.
This can easily done in parallel and on the EAGLE_25 shows significant improvements in the code speed and scalability. Although I should say that this comes from running this on 16 cores only and based on the vTune outputs (which usually match the actual tests).
What do you think ?