Skip to content
Snippets Groups Projects

Buffered cell_split

Merged Pedro Gonnet requested to merge cell_split into master

This "works", but my tests crash because of #248 (closed), which I get every time I run EAGLE_12 with a single thread.

@pdraper, can you give this a spin through your regular tests to see if there are no hidden bugs/problems I've missed? Thanks!

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Author Developer

    @jwillis, can you do a quick scaling test with this branch, just to see if cell_split scales better, i.e. just check if we're on the right track before the meeting tomorrow? Thanks!

  • Pedro Gonnet Title changed from [WIP] Buffered cell_split to Buffered cell_split

    Title changed from [WIP] Buffered cell_split to Buffered cell_split

  • We are all busy conferencing here so I think we'll postpone the meeting.

  • Peter W. Draper Added 4 commits:

    Added 4 commits:

  • Peter W. Draper Added 1 commit:

    Added 1 commit:

  • Just tried this out myself and it all seemed to be working until I tried a build with optimization and no sanitizer, now it crashes all the time at:

    cell_split (c=c@entry=0x6ddca0, parts_offset=165033, buff=buff@entry=0x2aaac40b2110, 
        gbuff=gbuff@entry=0x0) at cell.c:526
    526               memswap(&buff[j], &temp_buff, sizeof(struct cell_buff));

    bit tricky to diagnose as when I switch optimization off it runs. This is on my desktop with GCC 5.4.

  • Author Developer

    Trying to reproduce, what does it crash with? SIGSEGV? And does this happen immediately, i.e. on the first call, or eventually?

  • Author Developer

    Currently can't reproduce, this is what configure says:

       Compiler        : gcc
        - vendor       : gnu
        - version      : 5.4.1
        - flags        : -g -fuse-ld=gold -O3 -fomit-frame-pointer -malign-double -fstrict-aliasing -ffast-math -funroll-loops -march=native -mavx2 -pthread -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free -Wall -Wextra -Wno-unused-parameter -Werror
       MPI enabled     : no
       HDF5 enabled    : yes
        - parallel     : yes
       Metis enabled   : yes
       FFTW3 enabled   : yes
       libNUMA enabled : yes
       Using tcmalloc  : yes
       Using jemalloc  : no
       CPU profiler    : yes
    
       Hydro scheme       : gadget2
       Dimensionality     : 3
       Kernel function    : cubic-spline
       Equation of state  : ideal-gas
       Adiabatic index    : 5/3
       Riemann solver     : none
       Cooling function   : none
       External potential : none
       Task debugging     : no
       Debugging checks   : no

    But I think I just might have fixed a subtle bug in memswap...

  • Pedro Gonnet Added 3 commits:

    Added 3 commits:

    • 774acb3f - check part positions as doubles, not floats.
    • 9d65b220 - Merge branch 'cell_split' of gitlab.cosma.dur.ac.uk:swift/swiftsim into cell_split
    • d71008ff - wasn't directly decrementing loop variable, use better names in swap_loop macro.
  • Pedro Gonnet Added 1 commit:

    Added 1 commit:

    • 186fee40 - align cell_buff to 32 bytes, makes swapping more efficient at the cost of four more bytes.
  • Author Developer

    I'm running it with

    [pedro@laika EAGLE_12]$ gdb --ex run --args ../swift -v 2 -t 2 -n 1000 -s eagle_12.yml
  • Pulled that fix and it is still crashing, with a SIGSEGV. It is early in the job, i.e. before the first step:

    gdb --ex run --args ../swift -v 2 -t 2 -n 1000 -s eagle_12.yml 
    
    GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
    Copyright (C) 2016 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
    and "show warranty" for details.
    This GDB was configured as "x86_64-linux-gnu".
    Type "show configuration" for configuration details.
    For bug reporting instructions, please see:
    <http://www.gnu.org/software/gdb/bugs/>.
    Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.
    For help, type "help".
    Type "apropos word" to search for commands related to "word"...
    Reading symbols from ../swift...done.
    Starting program: /loc/pwda/pdraper/scratch/swift-tests/swiftsim-cell_split/examples/swift -v 2 -t 2 -n 1000 -s eagle_12.yml
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
     Welcome to the cosmological hydrodynamical code
        ______       _________________
       / ___/ |     / /  _/ ___/_  __/
       \__ \| | /| / // // /_   / /   
      ___/ /| |/ |/ // // __/  / /    
     /____/ |__/|__/___/_/    /_/     
     SPH With Inter-dependent Fine-grained Tasking
    
     Version : 0.4.0
     Revision: v0.4.0-598-g186fee40, Branch: cell_split
     Webpage : www.swiftsim.com
    
     Config. options: '--with-metis --enable-debug'
    
     Compiler: GCC, Version: 5.4.0
     CFLAGS  : '-g -O0  -gdwarf -fvar-tracking-assignments -O3 -fomit-frame-pointer -malign-double -fstrict-aliasing -ffast-math -funroll-loops -march=sandybridge -mavx -pthread -Wall -Wextra -Wno-unused-parameter -Werror'
    
     HDF5 library version: 1.8.16
    
    [00000.0] main: CPU frequency used for tick conversion: 3392294217 Hz
    [00000.0] main: Running on: starpc1
    [00000.0] main: sizeof(struct part)  is  128 bytes.
    [00000.0] main: sizeof(struct xpart) is   32 bytes.
    [00000.0] main: sizeof(struct gpart) is   96 bytes.
    [00000.0] main: sizeof(struct task)  is   64 bytes.
    [00000.0] main: sizeof(struct cell)  is  416 bytes.
    [00000.0] main: Reading runtime parameters from file 'eagle_12.yml'
    [00000.0] main: Internal unit system: U_M = 1.989000e+43 g.
    [00000.0] main: Internal unit system: U_L = 3.085678e+24 cm.
    [00000.0] main: Internal unit system: U_t = 3.085678e+19 s.
    [00000.0] main: Internal unit system: U_I = 1.000000e+00 A.
    [00000.0] main: Internal unit system: U_T = 1.000000e+00 K.
    [00000.0] phys_const_print:    Gravitational constant = 4.302051e+01
    [00000.0] phys_const_print:            Speed of light = 2.997925e+05
    [00000.0] phys_const_print:           Planck constant = 9.787529e-02
    [00000.0] phys_const_print:        Boltzmann constant = 6.941420e-70
    [00000.0] phys_const_print:     Thomson cross-section = 6.986843e-74
    [00000.0] phys_const_print:             Electron-Volt = 8.055187e-66
    [00000.0] phys_const_print:                      Year = 1.022690e-12
    [00000.0] phys_const_print:         Astronomical Unit = 4.848136e-12
    [00000.0] phys_const_print:                    Parsec = 9.999999e-07
    [00000.0] phys_const_print:                Solar mass = 9.997486e-11
    [00000.0] main: Reading ICs from file './EAGLE_ICs_12.hdf5'
    [00000.0] read_ic_single: IC and internal units match. No conversion needed.
    [00004.6] read_ic_single: Particle Type 4 not yet supported. Particles ignored
    [00004.6] read_ic_single: Particle Type 5 not yet supported. Particles ignored
    [00004.8] main: Reading initial conditions took 4792.568 ms.
    [00004.9] main: Read 6387423 gas particles and 0 gparts from the ICs.
    [00004.9] space_init: max_size set to 8000000, sub_size set to 64000000, split_size set to 400
    [00005.3] space_regrid: h_max is 3.480e-01 (cell_min=6.989e-01).
    [00005.3] space_regrid: set cell dimensions to [ 12 12 12 ].
    [00005.3] space_regrid: took 42.454 ms.
    [00005.3] main: space_init took 373.848 ms.
    [00005.3] main: space dimensions are [ 8.471 8.471 8.471 ].
    [00005.3] main: space is periodic.
    [00005.3] main: highest-level cell dimensions are [ 12 12 12 ].
    [00005.3] main: 6387423 parts in 1728 cells.
    [00005.3] main: 0 gparts in 1728 cells.
    [00005.3] main: maximum depth is 0.
    [00005.3] main: map_cellcheck picked up 0 parts.
    [00005.3] main: nr of cells at depth 0 is 1728.
    [00005.3] engine_init: no processor affinity used
    [00005.3] engine_policy: engine policies are [  steal  keep  numa_affinity  hydro  ]
    [New Thread 0x2aaae9539700 (LWP 21070)]
    [New Thread 0x2aaae973a700 (LWP 21071)]
    [New Thread 0x2aaae993b700 (LWP 21072)]
    [New Thread 0x2aaae9b3c700 (LWP 21073)]
    [00005.3] hydro_props_print: Equation of state: Ideal gas.
    [00005.3] hydro_props_print: Adiabatic index gamma: 1.666667.
    [00005.3] hydro_props_print: Hydrodynamic scheme: Gadget-2 version of SPH (Springel 2005) in 3D.
    [00005.3] hydro_props_print: Hydrodynamic kernel: Cubic spline (M4) with 48.00 +/- 0.10 neighbours (eta=1.234800).
    [00005.3] hydro_props_print: Hydrodynamic integration: CFL parameter: 0.1000.
    [00005.3] hydro_props_print: Hydrodynamic integration: Max change of volume: 2.00 (max|dlog(h)/dt|=0.231049).
    [00005.3] engine_init: Absolute minimal timestep size: 3.725290e-11
    [00005.3] engine_init: Minimal timestep size (on time-line): 7.450580e-11
    [00005.3] engine_init: Maximal timestep size (on time-line): 7.812500e-05
    [00005.3] engine_compute_next_snapshot_time: Next output time set to t=9.999999e-04.
    [00005.3] engine_init: runner 0 using qid=0 no cpuid.
    [00005.3] engine_init: runner 1 using qid=1 no cpuid.
    [00005.3] main: engine_init took 1.550 ms.
    [00005.3] main: Running on 6387423 gas particles and 0 DM particles from t=0.000e+00 until t=1.000e-02 with 2 threads and 2 queues (dt_min=1.000e-10, dt_max=1.000e-04)...
    
    Thread 3 "swift" received signal SIGSEGV, Segmentation fault.
    [Switching to Thread 0x2aaae973a700 (LWP 21071)]
    cell_split (c=c@entry=0xd05940, parts_offset=7424, buff=buff@entry=0x2aaaf006d130, gbuff=gbuff@entry=0x0) at cell.c:528
    528               memswap(&buff[j], &temp_buff, sizeof(struct cell_buff));
    (gdb) quit
    

    Here are my config options:

    
       Compiler        : mpicc
        - vendor       : gnu
        - version      : 5.4.0
        - flags        : -g -O0  -gdwarf -fvar-tracking-assignments -O3 -fomit-frame-pointer -malign-double -fstrict-aliasing -ffast-math -funroll-loops -march=sandybridge -mavx -pthread -Wall -Wextra -Wno-unused-parameter -Werror
       MPI enabled     : yes
       HDF5 enabled    : yes
        - parallel     : no
       Metis enabled   : yes
       FFTW3 enabled   : no
       libNUMA enabled : yes
       Using tcmalloc  : no
       Using jemalloc  : no
       CPU profiler    : yes
    
       Hydro scheme       : gadget2
       Dimensionality     : 3
       Kernel function    : cubic-spline
       Equation of state  : ideal-gas
       Adiabatic index    : 5/3
       Riemann solver     : none
       Cooling function   : none
       External potential : none
       Task debugging     : no
       Debugging checks   : no

    Here's the back trace from gdb:

    #0  0x000000000045d73b in memswap (bytes=32, void_b=<optimised out>, void_a=0x2baad8074fb0, void_a@entry=0x2baad806d130) at memswap.h:66
    #1  cell_split (c=c@entry=0x1cddd40, parts_offset=7424, buff=buff@entry=0x2baad806d130, gbuff=gbuff@entry=0x0) at cell.c:528
    #2  0x0000000000409c8a in space_split_recursive (s=s@entry=0x7ffe9efc9540, c=c@entry=0x1cddd40, buff=<optimised out>, buff@entry=0x0, 
        gbuff=<optimised out>, gbuff@entry=0x0) at space.c:1549
    #3  0x000000000040a7b2 in space_split_mapper (map_data=0x1cddd40, num_cells=<optimised out>, extra_data=0x7ffe9efc9540) at space.c:1641
    #4  0x000000000047bd76 in threadpool_runner (data=0x7ffe9efc9710) at threadpool.c:76
    #5  0x00002baa951056ba in start_thread (arg=0x2baad3964700) at pthread_create.c:333
    #6  0x00002baa95dd282d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
    

    and the core now leads me to:

    Program terminated with signal SIGSEGV, Segmentation fault.
    #0  0x000000000045d73b in memswap (bytes=32, void_b=<optimised out>, void_a=0x2baad8074fb0, void_a@entry=0x2baad806d130) at memswap.h:66
    66        swap_loop(__m256i, a, b, bytes);
    [Current thread is 1 (Thread 0x2baad3964700 (LWP 26491))]
    (gdb) info locals
    temp = <optimised out>
    a = 0x2baad8074fb0 "aW\351\035\177\016\322?\202_A\024\f\254\335?\030\034\277\305\071\216\034@\002"
    b = 0x2baad3963da0 "\021\375B\354\270\060\326?\373\177\202\373 \274\326?\020\220\r\322@\257\035@\003"
    (gdb) print bytes
    $1 = 32

    as I said if I disable optimization the bug goes away, so you cannot rule out a bug in this compiler.

  • Author Developer

    does this also happens when you use tcmalloc? or if you compile with icc? or a different gcc version on your machine?

    if not the latter, I'd agree that it's most likely a compiler bug.

  • Author Developer

    comparing void_a and void_a@entry, it looks like the termination criteria is somehow broken. will check what the assembly looks like with godbolt!

    Edited by Pedro Gonnet
  • Author Developer

    Created a minimal example, and there doesn't seem to be a difference between gcc 5.4.0 and gcc 6.1.0 (the next-higher version available there): https://godbolt.org/g/UJeujv.

  • icc generates different assembly though.

    Could it be an aliasing issue somewhere ? Like swapping something with itself ?

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading