Skip to content

Segmentation fault in space_gparts_sort() in EAGLE 50Mpc box

I've had an EAGLE 50Mpc box run crash with a segmentation fault. Stderr is not very helpful:

srun: error: irene1046: task 0: Segmentation fault
srun: First task exited 600s ago
srun: step:3912414.0 tasks 1-15: running
srun: step:3912414.0 task 0: exited abnormally
srun: Terminating job step 3912414.0
slurmstepd-irene1046: error: *** STEP 3912414.0 ON irene1046 CANCELLED AT 2020-03-31T19:27:53 ***

The last bit of stdout from swift is

[0000] [04322.1] space_allocate_extras: Requesting space for future 0/0/228400/0 part/gpart/sparts/bparts.
[0000] [04322.2] space_parts_get_cell_index: took 92.455 ms.
[0000] [04322.3] space_gparts_get_cell_index: took 89.441 ms.
[0000] [04322.3] space_sparts_get_cell_index: took 12.706 ms.
[0000] [04322.3] space_bparts_get_cell_index: took 0.380 ms.
[0000] [04322.4] space_rebuild: Moving non-local particles took 100.296 ms.
[0000] [04322.7] engine_exchange_strays: sent out 209/292/2/0 parts/gparts/sparts/bparts, got 97/187/0/0 back.
[0000] [04322.7] engine_exchange_strays: took 218.005 ms.
########## ########## ########## ########## ########## ########## ########## ##########
Execution Sum Up
########## ########## ########## ########## ########## ########## ########## ##########
...

As in #662 (closed), this is with MPI on 8 nodes and 384 cores with swift revision 77dc3b54 from master with the fix for multiple black hole mergers ( 39740c15 ) and repartitioning after a restart ( 41fadcb1 ) cherry picked in.

Configuration was

./configure CC=icc LIBS="${SWIFT_LIBS}" LDFLAGS="${SWIFT_LDFLAGS}" \
    --with-tbbmalloc=${TBB_ROOT}/lib/intel64/gcc4.7 \
    --enable-ipo \
    --with-hydro=sphenix \
    --with-kernel=wendland-C2 \
    --with-subgrid=EAGLE \
    --with-hdf5=/ccc/cont005/home/durham/hellyjoh/swift/bin/h5pcc \
    --with-fftw=/ccc/cont005/home/durham/hellyjoh/swift/ \
    --with-gsl=${GSL_ROOT} \
    --with-parmetis=/ccc/cont005/home/durham/hellyjoh/swift/

and I ran it with

ccc_mprun ${codedir}/swiftsim-bh-fix/examples/swift_mpi --verbose=1 --restart \
    --param=Restarts:resubmit_on_exit:1 \
    --param=Restarts:resubmit_command:${codedir}/irene/resub_fix.sh \
    --param=Restarts:max_run_time:23.0 \
    --param=Snapshots:output_list:${codedir}/output_times.txt \
    --param=InitialConditions:file_name:/ccc/store/cont005/ra4707/hellyjoh/EAGLE_ICs/SwiftICs/EAGLE_L0050N0752_ICs.hdf5 \
    --param=EAGLECooling:dir_name:${codedir}/Data/coolingtables/ \
    --param=EAGLEFeedback:filename:${codedir}/Data/yieldtables/ \
    --pin --cosmology ${eagle_flags} \
    --threads=24 eagle_50.yml

It crashed at about z=0.27. It happens again if I restart from the current or previous set of restart files. I've put the full set of logs on cosma in /cosma7/data/dp004/jch/EAGLE-XL/ParameterSearch-Swift/crash_logs/. See swift.3912414.12.out for the run where it crashed.

I tried restarting it in ddt and got a memory error at line 3091 in space.c, which is in space_gparts_sort():

while (ind[j] == target_cid) {

although it was an optimized run so I'm not sure how reliable that is.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information