Performance of full model

@pdraper @rgb @nnrw56

Here is an early analysis of a full model EAGLE-25 run running on 8 nodes (16 ranks) of cosma-7. This is with parmetis, tbbmalloc and the maximal level of code optimization and vectorization.

Total measured time: 84118.528 s
Total time: 84430.300000  s

Time spent in the different code sections:
 - 'Engine Launch                           ' (203508 calls, time: 37320.2026s): 44.2024%
 - 'Engine Collect End Of Step              ' (203506 calls, time: 35810.9253s): 42.4148%
 - 'Space Rebuild                           ' ( 8728 calls, time: 2364.8237s): 2.8009%
 - 'Engine Exchange Cells                   ' ( 8728 calls, time: 1570.1745s): 1.8597%
 - 'Writing Particle Properties             ' (  100 calls, time: 981.0003s): 1.1619%
 - 'Creating Recv Tasks                     ' ( 8728 calls, time: 912.8609s): 1.0812%
 - 'Communicating Rebuild Flag              ' (203506 calls, time: 823.3623s): 0.9752%
 - 'Engine Drift All                        ' ( 8947 calls, time: 716.6176s): 0.8488%
Elements in 'Other' category (<0.8%):
 - 'Exchanging Cell Tags                    ' ( 8728 calls, time: 535.5152s): 0.6343%
 - 'Gpart Assignment                        ' ( 8728 calls, time: 506.9014s): 0.6004%
 - 'Engine Unskip                           ' (194786 calls, time: 455.3908s): 0.5394%
 - 'Engine Print Task Counts                ' (212236 calls, time: 303.6919s): 0.3597%
 - 'Recursively Linking Foreign Arrays      ' ( 8728 calls, time: 247.5270s): 0.2932%
....

The time spent in Engine Collect End Of Step is basically imbalance time.

This is not great. We may need to re-assess the weight of the tasks we have.

The run re-partitioned 41 times over the course of these 200k steps.

Edited by Matthieu Schaller