Replaced process_plot_tasks and process_plot_tasks_MPI bash scripts with Python versions.
While trying to speed up the code for some large projects, I often had to produce task plots for large runs that used 16 ranks and were run for 100s of steps. The bash scripts in the repository were not very efficient for processing these, since they only partially run in parallel:
- All task plots are produced by calling
plot_tasks.py
in parallel. For MPI runs, this producesnrank
plots. - All files are analysed to produce task statistics, by calling
analyse_tasks.py
in parallel. For MPI runs, this parallel step is donenrank
times in a non-parallelised loop over ranks. - Web pages are produced. This is done in a serial loop over steps (and ranks).
If the task files are dominated by a small number of very large files, then the runtime of the individual steps is completely dominated by the processing time of those files. In that case, the parallel efficiency of the old bash scripts became very low.
I have addressed this by allowing plot_tasks.py
and analyse_tasks.py
to be called concurrently, using a new Python version of the bash scripts. Instead of running steps in parallel, a list of commands to run is created and then executed in parallel
using sub-processes spawned using the standard Python subprocess
module. This eliminates load-imbalances between steps 1 and 2.
To further improve the parallel efficiency, the new scripts also have an option to sort the command list according to a weight factor and then execute the commands in reverse order of the weights. Currently, the weights are determined by the size of the input file, with an additional factor 2 for plot_tasks.py
commands that are generally more expensive. This significantly improves parallel efficiency for cases where the total runtime is dominated by a single file, or for cases where large files are at the end of the alphabetically sorted list of input files.
I have tested and timed the new scripts using two test cases. All tests were performed using 32 parallel processes on a machine with plenty of available memory.
- The non-MPI case was tested on a 100 step task output from a COLIBRE zoom simulation. The original script took 147.6s, the new script 126.9s and the new script with additional sorting 120.2s.
- A second non-MPI case used the same task output, but then for 1000 steps. The original script took 2292.7s, the new script 1753.1s and the new script with additional sorting 1291.9s.
- The MPI case was tested on a 20 step task output from a COLIBRE profiling simulation. The original script never exploited all the 32 processes (since there are only 20 steps) and took 412.7s, the new script took 324.1s and the new script with additional sorting 313.6s.
In summary: the script is definitely an improvement for MPI runs with a lot of ranks, no matter how many steps are used. For non-MPI runs it definitely makes a difference for longer runs and it is not slower for short runs.
The whole process could probably be made even more efficient by making changes to plot_tasks.py
so that this script only outputs a task plot for a single rank (since creating the image seems to be the dominant factor for performance). But since I'm already happy with the current performance, I'll leave that for a future effort.
One issue I might want to address in the future as well (that affected both the old and new scripts) is that large task files can eat a lot of memory. My poor desktop has occasionally had to swap when processing 4 of those simultaneously. The new scripts could be made a bit more clever so that they do not launch commands that are expected to require more memory than available. Although that might be tricky to implement correctly.