Segfault in space_gparts_get_cell_index_mapper() in EAGLE-XL L0300N4512 (grid partitioned) DMONLY run
The large EAGLE-XL DMONLY run stopped with a segmentation fault at about z=11. The stderr file contains several instances of the following backtrace:
2 0x000000000006ba2c mxm_handle_error() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/util/debug/debug.c:641
2 0x000000000006ba2c mxm_handle_error() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/util/debug/debug.c:641
3 0x000000000006bf7c mxm_error_signal_handler() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/util/debug/debug.c:616
3 0x000000000006bf7c mxm_error_signal_handler() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/util/debug/debug.c:616
4 0x0000000000036280 killpg() ??:0
4 0x0000000000036280 killpg() ??:0
5 0x00000000004cb6a4 space_gparts_get_cell_index_mapper() ??:0
6 0x00000000004dd864 threadpool_runner() ??:0
7 0x0000000000007dd5 start_thread() pthread_create.c:0
5 0x00000000004cb6a4 space_gparts_get_cell_index_mapper() ??:0
6 0x00000000004dd864 threadpool_runner() ??:0
7 0x0000000000007dd5 start_thread() pthread_create.c:0
8 0x00000000000fdead __clone() ??:0
8 0x00000000000fdead __clone() ??:0
I think this means something went wrong in space_gparts_get_cell_index_mapper(). The last few stdout lines look like this:
740 4.105694e-04 0.0814512 11.2772873 3.644596e-07 43 44 0 1366082 0 0 1161.603 0
[0000] [17041.9] engine_drift_top_multipoles: took 6.208 ms.
[0000] [17041.9] engine_repartition_trigger: took 0.001 ms
[0000] [17047.2] engine_unskip: took 5277.506 ms.
[0000] [17048.3] engine_prepare: Communicating rebuild flag took 1143.846 ms.
[0000] [17048.3] engine_prepare: took 6421.359 ms (including unskip, rebuild and reweight).
[0000] [17048.3] engine_print_task_counts: System total: 6940989949, no. cells: 1854143173
[0000] [17048.3] engine_print_task_counts: Total = 13727129 (per cell = 5.24)
[0000] [17048.3] engine_print_task_counts: Total = 13727129 (maximum per cell = 7.24)
[0000] [17048.5] engine_print_task_counts: task counts are [ none=0 sort=0 self=9459 pair=10087209 sub_self=0 sub_pair=0 init_grav=9025 init_grav_out=727233 ghost\
_in=0 ghost=0 ghost_out=0 extra_ghost=0 drift_part=0 drift_spart=0 drift_bpart=0 drift_gpart=9216 drift_gpart_out=1023162 end_hydro_force=0 kick1=9025 kick2=9025 \
timestep=9025 timestep_limiter=0 send=14639 recv=14579 grav_long_range=9025 grav_mm=224704 grav_down_in=727233 grav_down=9025 grav_mesh=9025 grav_end_force=9025 c\
ooling=0 star_formation=0 star_formation_in=0 star_formation_out=0 logger=0 stars_in=0 stars_out=0 stars_ghost_in=0 stars_ghost=0 stars_ghost_out=0 stars_sort=0 s\
tars_resort=0 bh_in=0 bh_out=0 bh_density_ghost=0 bh_swallow_ghost1=0 bh_swallow_ghost2=0 bh_swallow_ghost3=0 fof_self=0 fof_pair=0 skipped=817495 ]
This was commit e4439b06 with the vla_ban and reverted_grav_depth_logic branches merged in.
Swift was configured with
./configure CC=icc CFLAGS=-qopt-zmm-usage=high \
--enable-ipo \
--with-hdf5 \
--with-fftw=${FFTW3_ROOT} \
--with-parmetis=${PARMETIS_ROOT} \
--with-gsl=${GSL_ROOT} \
--with-tbbmalloc=${TBB_ROOT}/lib/intel64/gcc4.7
The .yml file is here: EAGLE-XL_L0300N4512.yml
Maybe this could be caused by running out of memory due to bad load balancing, but I wouldn't have expected to have problems quite this soon and I think the run was only using around half of the total memory on each node at the start of the run.