Logger performance
FYI: @lhausammann and @nnrw56
I have been running the PMillennium-1536 example with and without the logger switched on. This has 3.6B particles and I am running it with 4 MPI ranks using 28 threads each. Latest master, Intel 2020 compiler, same config apart from --enable-logger
and the runtime --logger
flag switched on.
Overall verdict: The logger run is much slower and uses quite a bit more memory.
For the first 8 steps, the no logger run took 8000s. The run with logger took 21600s.
Key differences:
- All the tasks are slower with logger. Likely because the particles are more heavy and we are bandwidth limited.
- The run develops imbalance likely because of the logger task. That triggers a lot of redistribute.
- On the very first step, we take 45min to complete the step, 90% of that time spent in the logger task. Likely initialising things.
- Dumping index files almost every step, taking ~2min / step.
- About 15% of the time in the logger run is unaccounted for. Not sure what function call does not print the time spent in it but that needs fixing.
- Checking the RES memory on the compute nodes, the run with logger is 15% more greedy.
- The logger run crashed twice as it runs out of memory trying to balance things. The default run never builds up more than 1% MPI imbalance after the 2nd step and hence never redists.
Let me know if there is any useful information I can extract from these runs, let me know.
The log files (so far) are:
CPU balance log for the run with logger: rank_cpu_balance.log