Benchmarking of thread scalability with MPI

To be tested:

swift_mpi on the EAGLE_25 box compiled with "none" as external potential and with metis.
Run on 4 nodes with 1 to 16 threads per rank.
Run with -g to get the overheads from gparts.

Total time broken down into:
- engine_collect_timestep()
- engine_launch()
- engine_unskip()
- engine_drift_all()
- engine_rebuild()
- engine_repartition()
- engine_print_stats()
- scheduler_reweight()

engine_repartition() broken down into:
- partition_repartition()
- engine_redistribute()
- engine_makeproxies()

engine_rebuild() broken down into:
- space_rebuild()
- engine_maketasks()
- engine_marktasks()

space_rebuild() broken down into:
- space_regrid()
- space_parts_get_cell_index()
- space_gparts_get_cell_index()
- engine_exchange_strays()
- space_parts_sort() <-- Note this is also called in engine_redistribute(). Only want this call for now.
- space_gparts_sort() <-- Note this is also called in engine_redistribute(). Only want this call for now.
- part_relink_gparts_to_parts() <-- Note this is also called in space_split(). Only want this top-level call for now.
- part_relink_parts_to_gparts() <-- Note this is also called in space_split(). Only want this top-level call for now.
- space_split()

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information