Segmentation fault in space_gparts_sort() in EAGLE 50Mpc box
I've had an EAGLE 50Mpc box run crash with a segmentation fault. Stderr is not very helpful:
srun: error: irene1046: task 0: Segmentation fault
srun: First task exited 600s ago
srun: step:3912414.0 tasks 1-15: running
srun: step:3912414.0 task 0: exited abnormally
srun: Terminating job step 3912414.0
slurmstepd-irene1046: error: *** STEP 3912414.0 ON irene1046 CANCELLED AT 2020-03-31T19:27:53 ***
The last bit of stdout from swift is
[0000] [04322.1] space_allocate_extras: Requesting space for future 0/0/228400/0 part/gpart/sparts/bparts.
[0000] [04322.2] space_parts_get_cell_index: took 92.455 ms.
[0000] [04322.3] space_gparts_get_cell_index: took 89.441 ms.
[0000] [04322.3] space_sparts_get_cell_index: took 12.706 ms.
[0000] [04322.3] space_bparts_get_cell_index: took 0.380 ms.
[0000] [04322.4] space_rebuild: Moving non-local particles took 100.296 ms.
[0000] [04322.7] engine_exchange_strays: sent out 209/292/2/0 parts/gparts/sparts/bparts, got 97/187/0/0 back.
[0000] [04322.7] engine_exchange_strays: took 218.005 ms.
########## ########## ########## ########## ########## ########## ########## ##########
Execution Sum Up
########## ########## ########## ########## ########## ########## ########## ##########
...
As in #662 (closed), this is with MPI on 8 nodes and 384 cores with swift revision 77dc3b54 from master with the fix for multiple black hole mergers ( 39740c15 ) and repartitioning after a restart ( 41fadcb1 ) cherry picked in.
Configuration was
./configure CC=icc LIBS="${SWIFT_LIBS}" LDFLAGS="${SWIFT_LDFLAGS}" \
--with-tbbmalloc=${TBB_ROOT}/lib/intel64/gcc4.7 \
--enable-ipo \
--with-hydro=sphenix \
--with-kernel=wendland-C2 \
--with-subgrid=EAGLE \
--with-hdf5=/ccc/cont005/home/durham/hellyjoh/swift/bin/h5pcc \
--with-fftw=/ccc/cont005/home/durham/hellyjoh/swift/ \
--with-gsl=${GSL_ROOT} \
--with-parmetis=/ccc/cont005/home/durham/hellyjoh/swift/
and I ran it with
ccc_mprun ${codedir}/swiftsim-bh-fix/examples/swift_mpi --verbose=1 --restart \
--param=Restarts:resubmit_on_exit:1 \
--param=Restarts:resubmit_command:${codedir}/irene/resub_fix.sh \
--param=Restarts:max_run_time:23.0 \
--param=Snapshots:output_list:${codedir}/output_times.txt \
--param=InitialConditions:file_name:/ccc/store/cont005/ra4707/hellyjoh/EAGLE_ICs/SwiftICs/EAGLE_L0050N0752_ICs.hdf5 \
--param=EAGLECooling:dir_name:${codedir}/Data/coolingtables/ \
--param=EAGLEFeedback:filename:${codedir}/Data/yieldtables/ \
--pin --cosmology ${eagle_flags} \
--threads=24 eagle_50.yml
It crashed at about z=0.27. It happens again if I restart from the current or previous set of restart files. I've put the full set of logs on cosma in /cosma7/data/dp004/jch/EAGLE-XL/ParameterSearch-Swift/crash_logs/. See swift.3912414.12.out for the run where it crashed.
I tried restarting it in ddt and got a memory error at line 3091 in space.c, which is in space_gparts_sort():
while (ind[j] == target_cid) {
although it was an optimized run so I'm not sure how reliable that is.