Segfault in the scheduler
tl;dr: seems that we broke something in the tasking in commit ec3afbcc (Speedup the unskip and scheduler_start process)
I've come across a very weird bug with the Kelvin Helmholtz 2D test; this is not something that I believe to be constrained to purely this test, though. It segfaults pretty much immediately after reading the ICs; full output:
Welcome to the cosmological hydrodynamical code
______ _________________
/ ___/ | / / _/ ___/_ __/
\__ \| | /| / // // /_ / /
___/ /| |/ |/ // // __/ / /
/____/ |__/|__/___/_/ /_/
SPH With Inter-dependent Fine-grained Tasking
Version : 0.8.4
Revision: v0.8.4-421-geb0b5de3-dirty, Branch: master, Date: 2019-09-26 21:12:46 +0100
Webpage : www.swiftsim.com
Config. options: '--with-hydro=anarchy-du --with-kernel=quintic-spline --with-hydro-dimension=2 --disable-mpi --disable-doxygen-doc --disable-hand-vec --enable-debug --disable-optimization --enable-debugging-checks'
Compiler: ICC, Version: 18.0.20180210
CFLAGS : '-g -O0 -idirafter /usr/include/linux -debug inline-debug-info -pthread -w2 -Wunused-variable -Wshadow -Werror -Wstrict-prototypes'
HDF5 library version: 1.8.20
FFTW library version: 3.x (details not available)
GSL library version: 1.15
[00000.0] main: CPU frequency used for tick conversion: 2194644151 Hz
[00000.0] main: Running on: login7b.pri.cosma7.alces.network
[00000.0] main: WARNING: Debugging checks activated. Code will be slower !
[00000.0] main: sizeof(part) is 160 bytes.
[00000.0] main: sizeof(xpart) is 64 bytes.
[00000.0] main: sizeof(spart) is 128 bytes.
[00000.0] main: sizeof(bpart) is 128 bytes.
[00000.0] main: sizeof(gpart) is 128 bytes.
[00000.0] main: sizeof(multipole) is 192 bytes.
[00000.0] main: sizeof(grav_tensor) is 168 bytes.
[00000.0] main: sizeof(task) is 96 bytes.
[00000.0] main: sizeof(cell) is 1312 bytes.
[00000.0] main: Reading runtime parameters from file 'kelvinHelmholtz.yml'
[00000.0] main: Internal unit system: U_M = 1.000000e+00 g.
[00000.0] main: Internal unit system: U_L = 1.000000e+00 cm.
[00000.0] main: Internal unit system: U_t = 1.000000e+00 s.
[00000.0] main: Internal unit system: U_I = 1.000000e+00 A.
[00000.0] main: Internal unit system: U_T = 1.000000e+00 K.
[00000.0] phys_const_print: Gravitational constant = 6.674080e-08
[00000.0] phys_const_print: Speed of light = 2.997925e+10
[00000.0] phys_const_print: Planck constant = 6.626070e-27
[00000.0] phys_const_print: Boltzmann constant = 1.380649e-16
[00000.0] phys_const_print: Thomson cross-section = 6.652459e-25
[00000.0] phys_const_print: Electron-Volt = 1.602177e-12
[00000.0] phys_const_print: Year = 3.155693e+07
[00000.0] phys_const_print: Astronomical Unit = 1.495979e+13
[00000.0] phys_const_print: Parsec = 3.085678e+18
[00000.0] phys_const_print: Solar mass = 1.988480e+33
[00000.0] phys_const_print: km/s/Mpc = 3.240779e-18
[00000.0] cooling_print_backend: Cooling function is 'No cooling'.
[00000.0] chemistry_print_backend: Chemistry function is 'No chemistry'.
[00000.0] main: Reading ICs from file './kelvinHelmholtz.hdf5'
[00000.0] io_read_unit_system: Reading IC units from ICs.
[00000.0] read_ic_single: IC and internal units match. No conversion needed.
[00001.5] main: Reading initial conditions took 1536.908 ms.
[00001.6] part_verify_links: All links OK
[00001.6] part_verify_links: took 103.689 ms.
[00001.6] main: Read 6290560 gas particles, 0 stars particles, 0 black hole particles, 0 DM particles and 0 DM background particles from the ICs.
[00002.1] space_regrid: (re)griding space cdim=(12 12 12)
[00002.1] main: space_init took 409.827 ms.
[00002.1] potential_print_backend: External potential is 'No external potential'.
[00002.1] main: space dimensions are [ 1.000 1.000 1.000 ].
[00002.1] main: space is periodic.
[00002.1] main: highest-level cell dimensions are [ 12 12 12 ].
[00002.1] main: 6290560 parts in 1728 cells.
[00002.1] main: 0 gparts in 1728 cells.
[00002.1] main: 0 sparts in 1728 cells.
[00002.1] main: 0 bparts in 1728 cells.
[00002.1] main: maximum depth is 0.
[00002.1] engine_config: Running simulation 'Untitled SWIFT simulation'.
[00002.1] engine_config: no processor affinity used
[00002.1] engine_policy: engine policies are [ 'steal' 'keep' 'numa affinity' 'hydro' ]
[00002.1] eos_print: Equation of state: Ideal gas.
[00002.1] eos_print: Adiabatic index gamma: 1.666667.
[00002.1] hydro_props_print: Hydrodynamic scheme: ANARCHY (Density-Energy) SPH (Borrow+ in prep) in 2D.
[00002.1] hydro_props_print: Hydrodynamic kernel: Quintic spline (M6) with eta=1.234800 (22.31 neighbours).
[00002.1] hydro_props_print: Hydrodynamic relative tolerance in h: 0.00010 (+/- 0.0045 neighbours).
[00002.1] hydro_props_print: Hydrodynamic integration: CFL parameter: 0.1000.
[00002.1] hydro_props_print: Hydrodynamic integration: Max change of volume: 1.40 (max|dlog(h)/dt|=0.168236).
[00002.1] hydro_props_print: Neighbour number definition: Unweighted.
[00002.1] viscosity_print: Artificial viscosity parameters set to alpha: 0.100, max: 2.000, min: 0.000, length: 0.250.
[00002.1] diffusion_print: Artificial diffusion parameters set to alpha: 0.000, max: 1.000, min: 0.000, beta: 0.250.
[00002.1] entropy_floor_print: Entropy floor is 'no entropy floor'.
[00002.1] engine_config: Absolute minimal timestep size: 3.122502e-17
[00002.1] engine_config: Minimal timestep size (on time-line): 5.364418e-07
[00002.1] engine_config: Maximal timestep size (on time-line): 8.789062e-03
[00002.1] engine_config: Restarts will be dumped every 5.000000 hours
[New Thread 0x7fff9d35b700 (LWP 160539)]
[New Thread 0x7fff9cb5a700 (LWP 160540)]
[New Thread 0x7fff9c359700 (LWP 160541)]
[New Thread 0x7fff9bb58700 (LWP 160542)]
[New Thread 0x7fff9b357700 (LWP 160543)]
[New Thread 0x7fff9ab56700 (LWP 160544)]
[New Thread 0x7fff9a355700 (LWP 160545)]
[00002.1] main: engine_init took 13.960 ms.
[00002.1] main: Running on 6290560 gas particles, 0 stars particles 0 black hole particles and 0 DM particles (0 gravity particles)
[00002.1] main: from t=0.000e+00 until t=4.500e+00 with 1 ranks, 4 threads / rank and 4 task queues / rank (dt_min=1.000e-06, dt_max=1.000e-02)...
[00002.1] engine_init_particles: Setting particles to a valid state...
[00002.2] engine_init_particles: Computing initial gas densities.
[00002.2] space_rebuild: (re)building space
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff9c359700 (LWP 160541)]
0x00000000004923ab in cell_set_flag (c=0x0, flag=0) at cell.h:1352
1352 atomic_or(&c->flags, flag);
Missing separate debuginfos, use: debuginfo-install atlas-3.10.1-12.el7.x86_64 glibc-2.17-260.el7_6.5.x86_64 gsl-1.15-13.el7.x86_64 libgfortran-4.8.5-36.el7_6.2.x86_64 numactl-libs-2.0.9-7.el7.x86_64 zlib-1.2.7-18.el7.x86_64
The cell c
is at address 0x0
which, to me, seems pretty bad...
I have no idea how to go about debugging this, and it doesn't crash in the same way with the low resolution ICs. Thankfully, it reproduces perfectly with a fresh clone of master:
- Commit
eb0b5de3ee816b898f6c014d1b7e96d8faf207c1
- Modify IC generation script (see below) to get it to output a higher resolution IC.
./configure --with-hydro=anarchy-du --with-kernel=quintic-spline --with-hydro-dimension=2 --disable-mpi --disable-doxygen-doc --disable-hand-vec --enable-debug --disable-optimization --enable-debugging-checks
gdb --args ../../swift --hydro --threads=4 kelvinHelmholtz.yml
- Code does not FPE or hit any debugging check before it gets to this point.
Modules:
Currently Loaded Modulefiles:
1) python/3.6.5 5) parallel_hdf5/1.8.20
2) ffmpeg/4.0.2 6) gsl/2.4(default)
3) intel_comp/2018(default) 7) fftw/3.3.7(default)
4) intel_mpi/2018 8) parmetis/4.0.3(default)
Patch to apply (this simply ups the resolution to 2048 x 2048):
diff --git a/examples/HydroTests/KelvinHelmholtz_2D/makeIC.py b/examples/HydroTe
index 9190669..30f398b 100644
--- a/examples/HydroTests/KelvinHelmholtz_2D/makeIC.py
+++ b/examples/HydroTests/KelvinHelmholtz_2D/makeIC.py
@@ -24,7 +24,7 @@ import sys
# Generates a swift IC file for the Kelvin-Helmholtz vortex in a periodic box
# Parameters
-L2 = 256 # Particles along one edge in the low-density region
+L2 = 2048 # Particles along one edge in the low-density region
gamma = 5./3. # Gas adiabatic index
P1 = 2.5 # Central region pressure
P2 = 2.5 # Outskirts pressure
After a little investigation, I see that this was broken somewhere in the past 100 commits (between eb0b5de3 and fa3dab63). A little bit of bisection tells me that we broke this around commit ec3afbcc:
ec3afbcce859e48c93f2e47af8900c364fbb6eec is the first bad commit
commit ec3afbcce859e48c93f2e47af8900c364fbb6eec
Author: Matthieu Schaller <schaller@strw.leidenuniv.nl>
Date: Sat Sep 14 11:17:51 2019 +0100
Speedup the unskip and scheduler_start process
:040000 040000 2452fa21948b552ee1f7343a46dff0f628d0a78f 990bb23462d42293e7bb7091bf31251067bf4552 M src
I'm guessing this is some task scheduling magic and that I should very much not touch it... But could someone who has a bit more tasking-fu than me check this out?