New MPI FOF strategy
"I've been thinking about how to do the stitching together phase of fof and I have an idea. I'm not quite 100% sure this works so maybe someone can point out if there's some huge flaw I missed:
We initially run serial fof on each node as usual. This gives us a set
of fof fragments which each have a local index within their node and a
global ID as assigned by James (globalID = local index + number of
fragments on lower ranks).
For each fragment on the local node we identify 'remote' fragments it
should be joined to. This gives us a list of pairs of (local fragment
index, global ID of remote fragment to merge with). So far this I think
this is what James already implemented.
What we really want to know is for each fragment, what is the lowest
global ID we can reach by following the links in the full list of
mergers. I think we can find this with an iterative method where the
lowest global IDs percolate outwards one hop per iteration. The
implementation would look something like this:
1. Allocate a new array minimumID which has one element for each local
fof fragment. Initialize it to be equal to the fragment's globalID.
2. For each fragment, find the lowest global ID of anything we're going
to merge it with using just the local merging information. Use this to
update minimumID (this step is just an optimization - I think the
algorithm works without this step but you do one extra communication
iteration without it).
3. Next we want to determine the minimumID of each remote fragment we're
going to merge with and store the minimum of these and our own
minimumID. This means requesting information from the node the remote
fragment is stored on. We can make an array of the remote IDs from our
merger list and sort it. This puts it into order of where the remote
fragments are stored. Then we can do an MPI_ALLTOALLV so that each
remote ID gets sent to the node that contains that fragment. The
receiving node finds the minimumID associated with each received ID. We
send this information back by doing another MPI_ALLTOALLV with the send
and receive buffers swapped.
4. We use the received information to update the minimumID for each of
our local fragments. The minimumID of each fragment is set to the
smallest minimumID of any of the remote fragments it is to be merged
with. We need to count how many minimumID values get changed across the
entire simulation. If the count is non-zero, we do another iteration by
going back to step 3. If we do an iteration where nothing changes then
we're done and minimumID contains the final FOF group identifiers.
I think in realistic cases the number of iterations should be no more
than the cube root of the number of nodes (e.g. if one group mostly
fills the volume). The worst case would be a long thin group that passes
through all nodes but avoids most of the boundaries (e.g. a space
filling curve shaped group!) in which case you would have one iteration
per node. I hope we can ignore that case.
Weak scaling of this method might be ok because adding more nodes and
increasing the volume at fixed resolution doesn't increase the number of
iterations or the amount of data each node has to deal with. Although
I'm not sure how bad alltoall calls are for scaling - maybe depends on
having a good MPI implementaton." - John