Skip to content
Snippets Groups Projects

SWIFTmpistepsim

This project is a standalone part of SWIFT that aims to roughly simulate the MPI interactions that taking a single step of a SWIFT simulation makes. Making it possible to more easily see the performance of MPI calls, and also investigations of tuning more obvious.

The actual process within SWIFT is that queues of cell-based tasks are ran, with their priorities and dependencies determining the order that the tasks are ran in. Tasks are only added to a queue when they are ready to run, that is they are not waiting for other tasks. This order also determines when the sends and recvs needed to update data on other ranks are initiated as this happens when the associated task is queued. The sends and recvs are considered to be complete when MPI_Test returns true and this unlocks any dependencies they have. Obviously a step cannot complete until all the sends and recvs are themselves also complete, so the performance of the MPI library and lower layers is critical. This seems to be most significant, not when we have a lot of work, or very little, but for intermediary busy steps, when the local work completes much sooner than the MPI exchanges.

In SWIFT the enqueuing of tasks, thus send and recvs initiation (using MPI_Isend and MPI_Irecv) can happen from all the available threads, but the polling of MPI_Test is done primarily using two queues, but these can steal work from other queues, and other queues can steal MPI_Test calls as well. Enqueuing and processing can happen at the same time.

To keep this simple this package uses three threads to simulate all this, a thread that does the task of initiating the sends and recvs and two threads that poll for completion of the sends and recvs. All threads run at the same time.

The send and recvs themselves are captured from a run of SWIFT when configured using the configure option --enable-mpiuse-reports. When this is enabled each step of the simulation produces logs for each rank which record when the MPI interaction was started and when it completed. Other information such as the ranks involved, the size of the data exchanged, the MPI tags used and which SWIFT task types were used are also recorded.

We read a concatenated log of all these outputs for a single step, and try to use the relative times that the interaction were started as a guide, the completions are just polled in time completion order until completion really occurs. It is also possible to just start all the interactions as quickly as possible for comparisons.

To use the program swiftmpistepsim you need to select the step of interest (for instance one whose run-time seems dominated by the MPI tasks) and then concatenate all the logs for that step into a single file. You can then run using:

   mpirun -np <nranks> swiftmpistepsim <step-log> <output-log>

which will output timings for the various MPI calls and record a log for the reproduction in the file <output-log>. Note you must use the same numbers of ranks as the original run of SWIFT.

The verbose output and output log can be inspected to see what delays are driving the elapsed time for the step. Mainly these seem to be outlier MPI_Test calls that take tens of milliseconds.

A script post-process.py can be ran on the output log to pair the sends and recvs across the ranks. This allows the inspection of how well things like eager exchanges are working and what effect the size of the packets has.


Peter W. Draper 24 Sep 2019.