Buffered cell_split
This "works", but my tests crash because of #248 (closed), which I get every time I run EAGLE_12 with a single thread.
@pdraper, can you give this a spin through your regular tests to see if there are no hidden bugs/problems I've missed? Thanks!
Merge request reports
Activity
Is the crash on the Sedov blast with
--enable-debugging-checks
? If so, the problem is unrelated to this branch. Master crashes as well. It's a drift problem, fixed in !292 (merged).
@jwillis, can you do a quick scaling test with this branch, just to see if
cell_split
scales better, i.e. just check if we're on the right track before the meeting tomorrow? Thanks!Added 4 commits:
-
1a500878...9e4ad79c - 3 commits from branch
master
- ed881319 - Merge branch 'master' into cell_split
-
1a500878...9e4ad79c - 3 commits from branch
Added 1 commit:
- d8228584 - Fix documentation
Just tried this out myself and it all seemed to be working until I tried a build with optimization and no sanitizer, now it crashes all the time at:
cell_split (c=c@entry=0x6ddca0, parts_offset=165033, buff=buff@entry=0x2aaac40b2110, gbuff=gbuff@entry=0x0) at cell.c:526 526 memswap(&buff[j], &temp_buff, sizeof(struct cell_buff));
bit tricky to diagnose as when I switch optimization off it runs. This is on my desktop with GCC 5.4.
Currently can't reproduce, this is what
configure
says:Compiler : gcc - vendor : gnu - version : 5.4.1 - flags : -g -fuse-ld=gold -O3 -fomit-frame-pointer -malign-double -fstrict-aliasing -ffast-math -funroll-loops -march=native -mavx2 -pthread -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free -Wall -Wextra -Wno-unused-parameter -Werror MPI enabled : no HDF5 enabled : yes - parallel : yes Metis enabled : yes FFTW3 enabled : yes libNUMA enabled : yes Using tcmalloc : yes Using jemalloc : no CPU profiler : yes Hydro scheme : gadget2 Dimensionality : 3 Kernel function : cubic-spline Equation of state : ideal-gas Adiabatic index : 5/3 Riemann solver : none Cooling function : none External potential : none Task debugging : no Debugging checks : no
But I think I just might have fixed a subtle bug in
memswap
...Added 1 commit:
- 186fee40 - align cell_buff to 32 bytes, makes swapping more efficient at the cost of four more bytes.
Pulled that fix and it is still crashing, with a SIGSEGV. It is early in the job, i.e. before the first step:
gdb --ex run --args ../swift -v 2 -t 2 -n 1000 -s eagle_12.yml GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1 Copyright (C) 2016 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from ../swift...done. Starting program: /loc/pwda/pdraper/scratch/swift-tests/swiftsim-cell_split/examples/swift -v 2 -t 2 -n 1000 -s eagle_12.yml [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Welcome to the cosmological hydrodynamical code ______ _________________ / ___/ | / / _/ ___/_ __/ \__ \| | /| / // // /_ / / ___/ /| |/ |/ // // __/ / / /____/ |__/|__/___/_/ /_/ SPH With Inter-dependent Fine-grained Tasking Version : 0.4.0 Revision: v0.4.0-598-g186fee40, Branch: cell_split Webpage : www.swiftsim.com Config. options: '--with-metis --enable-debug' Compiler: GCC, Version: 5.4.0 CFLAGS : '-g -O0 -gdwarf -fvar-tracking-assignments -O3 -fomit-frame-pointer -malign-double -fstrict-aliasing -ffast-math -funroll-loops -march=sandybridge -mavx -pthread -Wall -Wextra -Wno-unused-parameter -Werror' HDF5 library version: 1.8.16 [00000.0] main: CPU frequency used for tick conversion: 3392294217 Hz [00000.0] main: Running on: starpc1 [00000.0] main: sizeof(struct part) is 128 bytes. [00000.0] main: sizeof(struct xpart) is 32 bytes. [00000.0] main: sizeof(struct gpart) is 96 bytes. [00000.0] main: sizeof(struct task) is 64 bytes. [00000.0] main: sizeof(struct cell) is 416 bytes. [00000.0] main: Reading runtime parameters from file 'eagle_12.yml' [00000.0] main: Internal unit system: U_M = 1.989000e+43 g. [00000.0] main: Internal unit system: U_L = 3.085678e+24 cm. [00000.0] main: Internal unit system: U_t = 3.085678e+19 s. [00000.0] main: Internal unit system: U_I = 1.000000e+00 A. [00000.0] main: Internal unit system: U_T = 1.000000e+00 K. [00000.0] phys_const_print: Gravitational constant = 4.302051e+01 [00000.0] phys_const_print: Speed of light = 2.997925e+05 [00000.0] phys_const_print: Planck constant = 9.787529e-02 [00000.0] phys_const_print: Boltzmann constant = 6.941420e-70 [00000.0] phys_const_print: Thomson cross-section = 6.986843e-74 [00000.0] phys_const_print: Electron-Volt = 8.055187e-66 [00000.0] phys_const_print: Year = 1.022690e-12 [00000.0] phys_const_print: Astronomical Unit = 4.848136e-12 [00000.0] phys_const_print: Parsec = 9.999999e-07 [00000.0] phys_const_print: Solar mass = 9.997486e-11 [00000.0] main: Reading ICs from file './EAGLE_ICs_12.hdf5' [00000.0] read_ic_single: IC and internal units match. No conversion needed. [00004.6] read_ic_single: Particle Type 4 not yet supported. Particles ignored [00004.6] read_ic_single: Particle Type 5 not yet supported. Particles ignored [00004.8] main: Reading initial conditions took 4792.568 ms. [00004.9] main: Read 6387423 gas particles and 0 gparts from the ICs. [00004.9] space_init: max_size set to 8000000, sub_size set to 64000000, split_size set to 400 [00005.3] space_regrid: h_max is 3.480e-01 (cell_min=6.989e-01). [00005.3] space_regrid: set cell dimensions to [ 12 12 12 ]. [00005.3] space_regrid: took 42.454 ms. [00005.3] main: space_init took 373.848 ms. [00005.3] main: space dimensions are [ 8.471 8.471 8.471 ]. [00005.3] main: space is periodic. [00005.3] main: highest-level cell dimensions are [ 12 12 12 ]. [00005.3] main: 6387423 parts in 1728 cells. [00005.3] main: 0 gparts in 1728 cells. [00005.3] main: maximum depth is 0. [00005.3] main: map_cellcheck picked up 0 parts. [00005.3] main: nr of cells at depth 0 is 1728. [00005.3] engine_init: no processor affinity used [00005.3] engine_policy: engine policies are [ steal keep numa_affinity hydro ] [New Thread 0x2aaae9539700 (LWP 21070)] [New Thread 0x2aaae973a700 (LWP 21071)] [New Thread 0x2aaae993b700 (LWP 21072)] [New Thread 0x2aaae9b3c700 (LWP 21073)] [00005.3] hydro_props_print: Equation of state: Ideal gas. [00005.3] hydro_props_print: Adiabatic index gamma: 1.666667. [00005.3] hydro_props_print: Hydrodynamic scheme: Gadget-2 version of SPH (Springel 2005) in 3D. [00005.3] hydro_props_print: Hydrodynamic kernel: Cubic spline (M4) with 48.00 +/- 0.10 neighbours (eta=1.234800). [00005.3] hydro_props_print: Hydrodynamic integration: CFL parameter: 0.1000. [00005.3] hydro_props_print: Hydrodynamic integration: Max change of volume: 2.00 (max|dlog(h)/dt|=0.231049). [00005.3] engine_init: Absolute minimal timestep size: 3.725290e-11 [00005.3] engine_init: Minimal timestep size (on time-line): 7.450580e-11 [00005.3] engine_init: Maximal timestep size (on time-line): 7.812500e-05 [00005.3] engine_compute_next_snapshot_time: Next output time set to t=9.999999e-04. [00005.3] engine_init: runner 0 using qid=0 no cpuid. [00005.3] engine_init: runner 1 using qid=1 no cpuid. [00005.3] main: engine_init took 1.550 ms. [00005.3] main: Running on 6387423 gas particles and 0 DM particles from t=0.000e+00 until t=1.000e-02 with 2 threads and 2 queues (dt_min=1.000e-10, dt_max=1.000e-04)... Thread 3 "swift" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x2aaae973a700 (LWP 21071)] cell_split (c=c@entry=0xd05940, parts_offset=7424, buff=buff@entry=0x2aaaf006d130, gbuff=gbuff@entry=0x0) at cell.c:528 528 memswap(&buff[j], &temp_buff, sizeof(struct cell_buff)); (gdb) quit
Here are my config options:
Compiler : mpicc - vendor : gnu - version : 5.4.0 - flags : -g -O0 -gdwarf -fvar-tracking-assignments -O3 -fomit-frame-pointer -malign-double -fstrict-aliasing -ffast-math -funroll-loops -march=sandybridge -mavx -pthread -Wall -Wextra -Wno-unused-parameter -Werror MPI enabled : yes HDF5 enabled : yes - parallel : no Metis enabled : yes FFTW3 enabled : no libNUMA enabled : yes Using tcmalloc : no Using jemalloc : no CPU profiler : yes Hydro scheme : gadget2 Dimensionality : 3 Kernel function : cubic-spline Equation of state : ideal-gas Adiabatic index : 5/3 Riemann solver : none Cooling function : none External potential : none Task debugging : no Debugging checks : no
Here's the back trace from gdb:
#0 0x000000000045d73b in memswap (bytes=32, void_b=<optimised out>, void_a=0x2baad8074fb0, void_a@entry=0x2baad806d130) at memswap.h:66 #1 cell_split (c=c@entry=0x1cddd40, parts_offset=7424, buff=buff@entry=0x2baad806d130, gbuff=gbuff@entry=0x0) at cell.c:528 #2 0x0000000000409c8a in space_split_recursive (s=s@entry=0x7ffe9efc9540, c=c@entry=0x1cddd40, buff=<optimised out>, buff@entry=0x0, gbuff=<optimised out>, gbuff@entry=0x0) at space.c:1549 #3 0x000000000040a7b2 in space_split_mapper (map_data=0x1cddd40, num_cells=<optimised out>, extra_data=0x7ffe9efc9540) at space.c:1641 #4 0x000000000047bd76 in threadpool_runner (data=0x7ffe9efc9710) at threadpool.c:76 #5 0x00002baa951056ba in start_thread (arg=0x2baad3964700) at pthread_create.c:333 #6 0x00002baa95dd282d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
and the core now leads me to:
Program terminated with signal SIGSEGV, Segmentation fault. #0 0x000000000045d73b in memswap (bytes=32, void_b=<optimised out>, void_a=0x2baad8074fb0, void_a@entry=0x2baad806d130) at memswap.h:66 66 swap_loop(__m256i, a, b, bytes); [Current thread is 1 (Thread 0x2baad3964700 (LWP 26491))] (gdb) info locals temp = <optimised out> a = 0x2baad8074fb0 "aW\351\035\177\016\322?\202_A\024\f\254\335?\030\034\277\305\071\216\034@\002" b = 0x2baad3963da0 "\021\375B\354\270\060\326?\373\177\202\373 \274\326?\020\220\r\322@\257\035@\003" (gdb) print bytes $1 = 32
as I said if I disable optimization the bug goes away, so you cannot rule out a bug in this compiler.
comparing
void_a
andvoid_a@entry
, it looks like the termination criteria is somehow broken. will check what the assembly looks like with godbolt!Edited by Pedro GonnetCreated a minimal example, and there doesn't seem to be a difference between gcc 5.4.0 and gcc 6.1.0 (the next-higher version available there): https://godbolt.org/g/UJeujv.