Delayed foreign allocation
Fixes #456 (closed).
We don't blindly allocate memory for foreign particles based on the top-level cell content any more. The new strategy is to recurse once down the cells once all the tasks (including the local-foreign pairs and MPI comms) have been constructed. In this first pass we only go down to the level where we reach tasks (so it's rather quick). We use this to count the number of particles that we will need to allocate. We then do that and finally link the particle arrays to their cells in the same way as before, just starting from the super level instead of the top level.
This saves quite a bit of memory and is also marginally faster as we only recurse into the parts of the (foreign) tree that we need to.
Future improvements possibly include:
- Do the same for the stars once we a have a better star-over-MPI strategy.
- Use the threadpool to parallelize the recursion since it is embarassingly parallel. (But it's also fast so do we care?)