Use the threadpool to parallelize operations in the particle-splitting code
Implements #641 (closed).
All the loops over the global particle arrays have been parallelized. I have also added alignment information to help the compiler use faster memcpy implementations.