Small runs crash during initial partitioning with ParMETIS
It looks like small runs, with <~ 1-10 million particles, crash consistently soon after start when run in MPI mode and SWIFT is compiled with ParMETIS. This is with the latest master. There is no problem when the same ICs are run without MPI, or with a non-ParMETIS build of the code, or when the same code is used to run a version of the ICs with 10x higher resolution.
The last lines of output are
[0000] [00001.7] main: Running on 1331000 gas particles, 0 sink particles, 0 stars particles 0 black hole particles, 0 neutrino particles, and 1331000 DM particles (2662000 gravity particles)
[0000] [00001.7] main: from t=1.220e-05 until t=1.410e-02 with 1 ranks, 28 threads / rank and 28 task queues / rank (dt_min=1.000e-10, dt_max=1.000e-02)...
and then
/var/slurm/slurmd/job4588668/slurm_script: line 23: 153857 Segmentation fault ./swift_mpi -v 1 --pin --threads=28 --cosmology --colibre --dust params.yml
(here from a run with the COLIBRE branch, but the identical issue appears for the main branch. Sometimes there is no error message, just a crash).
From a quick look with DDT, the segfault happens inside partition.c
, at line 468 (cut
is larger than the size of ptrs
).
Code setup:
Version : 0.9.0
Revision: v0.9.0-799-gdee0a623, Branch: master, Date: 2021-12-30 15:53:34 +0100
Webpage : www.swiftsim.com
Config. options: '--with-tbbmalloc --enable-ipo --with-hydro=sphenix --with-subgrid=EAGLE --with-kernel=wendland-C2 --with-parmetis'
Compiler: ICC, Version: 20.21.20201112
CFLAGS : '-ip -ipo -O3 -ansi-alias -xCORE-AVX512 -pthread -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free -w2 -Wunused-variable -Wshadow -Werror -Wstrict-prototypes'
HDF5 library version : 1.10.6
FFTW library version : 3.x (details not available)
GSL library version : 2.5
MPI library version : Intel(R) MPI Library 2018 Update 2 for Linux* OS (MPI std v3.1)
ParMETIS library version : 4.0.3
A simulation that reproduces the issue within a few seconds of run time is on Cosma at /cosma7/data/dp004/dc-bahe1/BEEHIVE/PARENT_BOXES/IDZZ_EagleL100_3e10
(initial conditions at /cosma7/data/dp004/dc-bahe1/BEEHIVE/PARENT_BOXES/ICs/EagleL100_3e10.hdf5
). The higher-resolution simulation that runs fine is at /cosma7/data/dp004/dc-bahe1/BEEHIVE/PARENT_BOXES/IDWW_EagleL100_3e9
.
It does not look like a show-stopper per se, since it appears to only affect runs that can easily fit on a single node anyway. But it would probably be good to get to the bottom of it nonetheless.