Segfault in space_gparts_get_cell_index_mapper() in EAGLE-XL L0300N4512 (grid partitioned) DMONLY run

The large EAGLE-XL DMONLY run stopped with a segmentation fault at about z=11. The stderr file contains several instances of the following backtrace:

 2 0x000000000006ba2c mxm_handle_error()  /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/util/debug/debug.c:641
 2 0x000000000006ba2c mxm_handle_error()  /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/util/debug/debug.c:641
 3 0x000000000006bf7c mxm_error_signal_handler()  /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/util/debug/debug.c:616
 3 0x000000000006bf7c mxm_error_signal_handler()  /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/util/debug/debug.c:616
 4 0x0000000000036280 killpg()  ??:0
 4 0x0000000000036280 killpg()  ??:0
 5 0x00000000004cb6a4 space_gparts_get_cell_index_mapper()  ??:0
 6 0x00000000004dd864 threadpool_runner()  ??:0
 7 0x0000000000007dd5 start_thread()  pthread_create.c:0
 5 0x00000000004cb6a4 space_gparts_get_cell_index_mapper()  ??:0
 6 0x00000000004dd864 threadpool_runner()  ??:0
 7 0x0000000000007dd5 start_thread()  pthread_create.c:0
 8 0x00000000000fdead __clone()  ??:0
 8 0x00000000000fdead __clone()  ??:0

I think this means something went wrong in space_gparts_get_cell_index_mapper(). The last few stdout lines look like this:

     740   4.105694e-04    0.0814512   11.2772873   3.644596e-07   43   44            0      1366082            0            0              1161.603      0
[0000] [17041.9] engine_drift_top_multipoles: took 6.208 ms.
[0000] [17041.9] engine_repartition_trigger: took 0.001 ms
[0000] [17047.2] engine_unskip: took 5277.506 ms.
[0000] [17048.3] engine_prepare: Communicating rebuild flag took 1143.846 ms.
[0000] [17048.3] engine_prepare: took 6421.359 ms (including unskip, rebuild and reweight).
[0000] [17048.3] engine_print_task_counts: System total: 6940989949, no. cells: 1854143173
[0000] [17048.3] engine_print_task_counts: Total = 13727129 (per cell = 5.24)
[0000] [17048.3] engine_print_task_counts: Total = 13727129 (maximum per cell = 7.24)
[0000] [17048.5] engine_print_task_counts: task counts are [ none=0 sort=0 self=9459 pair=10087209 sub_self=0 sub_pair=0 init_grav=9025 init_grav_out=727233 ghost\
_in=0 ghost=0 ghost_out=0 extra_ghost=0 drift_part=0 drift_spart=0 drift_bpart=0 drift_gpart=9216 drift_gpart_out=1023162 end_hydro_force=0 kick1=9025 kick2=9025 \
timestep=9025 timestep_limiter=0 send=14639 recv=14579 grav_long_range=9025 grav_mm=224704 grav_down_in=727233 grav_down=9025 grav_mesh=9025 grav_end_force=9025 c\
ooling=0 star_formation=0 star_formation_in=0 star_formation_out=0 logger=0 stars_in=0 stars_out=0 stars_ghost_in=0 stars_ghost=0 stars_ghost_out=0 stars_sort=0 s\
tars_resort=0 bh_in=0 bh_out=0 bh_density_ghost=0 bh_swallow_ghost1=0 bh_swallow_ghost2=0 bh_swallow_ghost3=0 fof_self=0 fof_pair=0 skipped=817495 ]

This was commit e4439b06 with the vla_ban and reverted_grav_depth_logic branches merged in.

Swift was configured with

./configure CC=icc CFLAGS=-qopt-zmm-usage=high \
    --enable-ipo \
    --with-hdf5 \
    --with-fftw=${FFTW3_ROOT} \
    --with-parmetis=${PARMETIS_ROOT} \
    --with-gsl=${GSL_ROOT} \
    --with-tbbmalloc=${TBB_ROOT}/lib/intel64/gcc4.7

The .yml file is here: EAGLE-XL_L0300N4512.yml

Maybe this could be caused by running out of memory due to bad load balancing, but I wouldn't have expected to have problems quite this soon and I think the run was only using around half of the total memory on each node at the start of the run.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information