Skip to content

Crash with "Interacting unsorted cell" in EAGLE 50Mpc box

I've had an EAGLE 50Mpc box run crash with the following on stderr:

[0003] [08452.7] runner_doiact_functions_stars.h:runner_dosub_pair_stars_density():1261: Interacting unsorted cell (parts). 3
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
In: PMI_Abort(-1, application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3)
slurmstepd-irene1012: error: *** STEP 3898257.0 ON irene1012 CANCELLED AT 2020-03-26T13:29:50 ***
srun: Job step aborted: Waiting up to 302 seconds for job step to finish.

The last few lines of stdout looks like this:

[0000] [08452.4] engine_print_task_counts: nr_sparts = 3082259.
[0000] [08452.4] engine_print_task_counts: nr_bparts = 3031.
[0000] [08452.4] engine_print_task_counts: took 1.531 ms.
[0000] [08452.6] engine_launch: (tasks) took 203.754 ms.
[0000] [08452.6] engine_unskip_timestep_communications: took 1.793 ms.

It was running with MPI on 8 nodes and 384 cores with swift revision 77dc3b54 from master with the fix for multiple black hole mergers ( 39740c15 ) and repartitioning after a restart ( 41fadcb1 ) cherry picked in.

Swift was configured with

./configure CC=icc LIBS="${SWIFT_LIBS}" LDFLAGS="${SWIFT_LDFLAGS}" \
    --with-tbbmalloc=${TBB_ROOT}/lib/intel64/gcc4.7 \
    --enable-ipo \
    --with-hydro=sphenix \
    --with-kernel=wendland-C2 \
    --with-subgrid=EAGLE \
    --with-hdf5=/ccc/cont005/home/durham/hellyjoh/swift/bin/h5pcc \
    --with-fftw=/ccc/cont005/home/durham/hellyjoh/swift/ \
    --with-gsl=${GSL_ROOT} \
    --with-parmetis=/ccc/cont005/home/durham/hellyjoh/swift/

The full log files and parameter file are here: run6_logs.tar.gz

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information