Crash with "Interacting unsorted cell" in EAGLE 50Mpc box
I've had an EAGLE 50Mpc box run crash with the following on stderr:
[0003] [08452.7] runner_doiact_functions_stars.h:runner_dosub_pair_stars_density():1261: Interacting unsorted cell (parts). 3
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
In: PMI_Abort(-1, application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3)
slurmstepd-irene1012: error: *** STEP 3898257.0 ON irene1012 CANCELLED AT 2020-03-26T13:29:50 ***
srun: Job step aborted: Waiting up to 302 seconds for job step to finish.
The last few lines of stdout looks like this:
[0000] [08452.4] engine_print_task_counts: nr_sparts = 3082259.
[0000] [08452.4] engine_print_task_counts: nr_bparts = 3031.
[0000] [08452.4] engine_print_task_counts: took 1.531 ms.
[0000] [08452.6] engine_launch: (tasks) took 203.754 ms.
[0000] [08452.6] engine_unskip_timestep_communications: took 1.793 ms.
It was running with MPI on 8 nodes and 384 cores with swift revision 77dc3b54 from master with the fix for multiple black hole mergers ( 39740c15 ) and repartitioning after a restart ( 41fadcb1 ) cherry picked in.
Swift was configured with
./configure CC=icc LIBS="${SWIFT_LIBS}" LDFLAGS="${SWIFT_LDFLAGS}" \
--with-tbbmalloc=${TBB_ROOT}/lib/intel64/gcc4.7 \
--enable-ipo \
--with-hydro=sphenix \
--with-kernel=wendland-C2 \
--with-subgrid=EAGLE \
--with-hdf5=/ccc/cont005/home/durham/hellyjoh/swift/bin/h5pcc \
--with-fftw=/ccc/cont005/home/durham/hellyjoh/swift/ \
--with-gsl=${GSL_ROOT} \
--with-parmetis=/ccc/cont005/home/durham/hellyjoh/swift/
The full log files and parameter file are here: run6_logs.tar.gz