Large MPI rank run hangs randomly on during engine launch tasks phase

Setup

I am running e13260c825d6ad2e28e579ce80d4cb58463f0b4e compiled with the following

--with-tbbmalloc --enable-memuse-reports --with-gravity=basic --with-metis=/apps/metis/5.1.0 --with-parmetis=/apps/parmetis/4.0.3 CC=icc FC=ifort --no-create --no-recursion

and modules are

2) intel-compiler/2019.3.199 3) openmpi/4.0.2(default) 4) fftw3/3.3.8 5) parmetis/4.0.3 6) metis/5.1.0 7) szip/2.1.1 8) hdf5/1.10.5p 9) gsl/2.6

This is on the gadi machine https://nci.org.au/our-systems/hpc-systems, which uses Intel Xeon Cascade Lake (I'll get the fabric).

Issue

The code seems to hang when running on many nodes (>100ish), two mpi ranks per node, 24 threads per MPI rank randomly during an engine_launch task phase. I am running a 4320^3 dark matter only run. Smaller runs of order 2000^3 using 16 nodes run without issue. I have tried running several times and it hangs on the 13 step, 35, 53, 55, 14. Seems quite random. The specific command used on PBS is

export OMP_NUMTHREADS=24
mpirun -np $(( $PBS_NCPUS / $OMP_NUM_THREADS )) --map-by node:PE=$OMP_NUM_THREADS --rank-by core --report-bindings --verbose=2 --pin --cosmology --self-gravity --threads=$OMP_NUM_THREADS --velociraptor --density_grids

Running with verbose=2 provides a lot of information to dig through but in general, there are some (1-20) mpi ranks that do not seem to complete the engine_launch (tasks) phase.

@matthieu suggested I try using

export FI_OFI_RXM_TX_SIZE=2048
export FI_OFI_RXM_RX_SIZE=2048

based on #520 (closed)

Interestingly, with these limits set, I now have an error occurring.

[0045] [01459.4] engine.c:engine_exchange_strays():675: Do not have a proxy for the requested nodeID 39 for part with id=38570592959, x=[-1.480353e+33,5.699500e+02,6.286355e+00].

This message occurs after all ranks have reported Moving non-local particles, naturally. Note that previous runs that stalled did not necessarily hang during this step. In fact, most seem to hang just before reporting engine_launch: (tasks) took #####.### ms.

I'm quite confused and have contacted NCI (people who run gadi) but am still awaiting a response. Perhaps this issue has been resolved and it is just a matter of specifying some extra mpi runtime arguments.

For completeness I have attached the output from the run that aborted and a run that just hung. output.abortrun.log.gz output.hangrun.short.log.gz

Note that I ran with verbose=2 so they are quite large. I hopefully will have updates from NCI to add to this issue.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information