Crash in engine_exchange_strays()
Hi all,
I'm running into problem running with swift_mpi on the OzSTAR cluster (https://supercomputing.swin.edu.au/ozstar/) - I am running a pure N-body cosmological simulation of a 40 Mpc/h box with 128^3 particles (progressively scaling up particle number, using existing runs for comparison). The run is on 4 processors and it completes about 20-30 timesteps before it hangs and the drops out with the error message,
[0000] [00435.7] engine.c:engine_exchange_strays():1348: Do not have a proxy for the requested nodeID 1 for part with id=1035059, x=[2.962571e+01,1.048334e+01,2.338945e+01].
I've had a look at the source code, and the issues on gitlab, but it's not obvious to me (yet) what the issue is.
Has anyone encountered this problem before? Output, slurm script, and .yml parameter file I've used are below.
Cheers,
Chris
>
> --------
> [0000] [00000.0] main: MPI is up and running with 4 node(s).
>
> Welcome to the cosmological hydrodynamical code
> ______ _________________
> / ___/ | / / _/ ___/_ __/
> \__ \| | /| / // // /_ / /
> ___/ /| |/ |/ // // __/ / /
> /____/ |__/|__/___/_/ /_/
> SPH With Inter-dependent Fine-grained Tasking
>
> Version : 0.8.0
> Revision: v0.8.0-9-gbc49a531-dirty, Branch: master, Date: 2018-11-19 17:23:37 +0100
> Webpage : www.swiftsim.com
>
> Config. options: '--prefix=/home/cpower/Codes --enable-mpi --enable-parallel-hdf5 --disable-compiler-warnings --with-tbb
> malloc --disable-vec'
>
> Compiler: GCC, Version: 6.4.0
> CFLAGS : '-O3 -fomit-frame-pointer -malign-double -fstrict-aliasing -ffast-math -funroll-loops -march=skylake-avx512 -m
> avx512dq -fno-tree-vectorize -pthread'
>
> HDF5 library version: 1.10.1
> FFTW library version: 3.x (details not available)
> GSL library version: 2.4
> MPI library: Open MPI v3.0.0 (MPI std v3.1)
>
> [0000] [00000.0] main: CPU frequency used for tick conversion: 2294417265 Hz
> [0000] [00000.0] main: sizeof(part) is 128 bytes.
> [0000] [00000.0] main: sizeof(xpart) is 64 bytes.
> [0000] [00000.0] main: sizeof(spart) is 96 bytes.
> [0000] [00000.0] main: sizeof(gpart) is 64 bytes.
> [0000] [00000.0] main: sizeof(multipole) is 176 bytes.
> [0000] [00000.0] main: sizeof(grav_tensor) is 144 bytes.
> [0000] [00000.0] main: sizeof(task) is 64 bytes.
> [0000] [00000.0] main: sizeof(cell) is 896 bytes.
> [0000] [00000.0] main: Reading runtime parameters from file './l40_n512.yml'
> [0000] [00000.0] main: Using METIS serial partitioning:
> [0000] [00000.0] main: initial partitioning: axis aligned grids of cells
> [0000] [00000.0] main: grid set to [ 2 1 2 ].
> [0000] [00000.0] main: repartitioning: none
> [0000] [00000.0] main: Internal unit system: U_M = 1.988480e+43 g.
> [0000] [00000.0] main: Internal unit system: U_L = 3.085678e+24 cm.
> [0000] [00000.0] main: Internal unit system: U_t = 3.085678e+19 s.
> [0000] [00000.0] main: Internal unit system: U_I = 1.000000e+00 A.
> [0000] [00000.0] main: Internal unit system: U_T = 1.000000e+00 K.
> [0000] [00000.0] phys_const_print: Gravitational constant = 4.300927e+01
> [0000] [00000.0] phys_const_print: Speed of light = 2.997925e+05
> [0000] [00000.0] phys_const_print: Planck constant = 1.079902e-99
> [0000] [00000.0] phys_const_print: Boltzmann constant = 6.943236e-70
> [0000] [00000.0] phys_const_print: Thomson cross-section = 6.986845e-74
> [0000] [00000.0] phys_const_print: Electron-Volt = 8.057293e-66
> [0000] [00000.0] phys_const_print: Year = 1.022690e-12
> [0000] [00000.0] phys_const_print: Astronomical Unit = 4.848137e-12
> [0000] [00000.0] phys_const_print: Parsec = 1.000000e-06
> [0000] [00000.0] phys_const_print: Solar mass = 1.000000e-10
> [0000] [00000.4] cosmology_print: Density parameters: [O_m, O_l, O_b, O_k, O_r] = [0.312100, 0.687900, 0.049100, 0.000000
> , 0.000000]
> [0000] [00000.4] cosmology_print: Dark energy equation of state: w_0=-1.000000 w_a=0.000000
> [0000] [00000.4] cosmology_print: Hubble constant: h = 0.675100, H_0 = 6.751000e+01 U_t^(-1)
> [0000] [00000.4] cosmology_print: Hubble time: 1/H0 = 1.481262e-02 U_t
> [0000] [00000.4] cosmology_print: Universe age at present day: 1.412332e-02 U_t
> [0000] [00000.4] main: Reading ICs from file './ICs/L40_N512_z99_cdm_sigma8_0.815.swft.hdf5'
> [0000] [00000.4] main: Cleaning up h-factors (h=0.675100)
> [0000] [00000.4] main: Cleaning up a-factors from velocity (a=0.010000)
> [0000] [00000.5] io_read_unit_system: 'Units' group not found in ICs. Assuming internal unit system.
> [0000] [00000.5] read_ic_serial: IC and internal units match. No conversion needed.
> [0000] [00002.1] main: Reading initial conditions took 1759.242 ms.
> [0000] [00002.1] main: Read 0 gas particles, 0 stars particles and 2097152 gparts from the ICs.
> [0000] [00002.1] main: space_init took 4.487 ms.
> [0000] [00002.1] main: space dimensions are [ 59.250 59.250 59.250 ].
> [0000] [00002.1] main: space is periodic.
> [0000] [00002.1] main: highest-level cell dimensions are [ 8 8 8 ].
> [0000] [00002.1] main: 0 parts in 512 cells.
> [0000] [00002.1] main: 524288 gparts in 512 cells.
> [0000] [00002.1] main: 0 sparts in 512 cells.
> [0000] [00002.1] main: maximum depth is 0.
> [0000] [00002.1] potential_print_backend: External potential is 'No external potential'.
> [0000] [00002.1] cooling_print_backend: Cooling function is 'No cooling'.
> [0000] [00002.1] chemistry_print_backend: Chemistry function is 'No chemistry'.
> [0000] [00002.1] engine_config: prefer NUMA-distant CPUs
> [0000] [00002.1] engine_init: cpu map is [ 9 ].
> [0000] [00002.2] engine_policy: engine policies are [ 'steal' 'keep' 'mpi' 'numa affinity' 'self gravity' 'cosmolog
> ical integration' ]
> [0000] [00002.2] gravity_props_print: Self-gravity scheme: Default (no potential)
> [0000] [00002.2] gravity_props_print: Self-gravity scheme: FMM-MM with m-poles of order 4
> [0000] [00002.2] gravity_props_print: Self-gravity time integration: eta=0.0250
> [0000] [00002.2] gravity_props_print: Self-gravity opening angle: theta=0.3000
> [0000] [00002.2] gravity_props_print: Self-gravity softening functional form: Wendland-C2
> [0000] [00002.2] gravity_props_print: Self-gravity comoving softening: epsilon=0.0060 (Plummer equivalent: 0.0020)
> [0000] [00002.2] gravity_props_print: Self-gravity maximal physical softening: epsilon=0.0060 (Plummer equivalent: 0.0
> 020)
> [0000] [00002.2] gravity_props_print: Self-gravity mesh side-length: N=512
> [0000] [00002.2] gravity_props_print: Self-gravity mesh smoothing-scale: a_smooth=1.250000
> [0000] [00002.2] gravity_props_print: Self-gravity tree cut-off ratio: r_cut_max=4.500000
> [0000] [00002.2] gravity_props_print: Self-gravity truncation cut-off ratio: r_cut_min=0.100000
> [0000] [00002.2] gravity_props_print: Self-gravity mesh truncation function: Gadget-like (using erfc())
> [0000] [00002.2] gravity_props_print: Self-gravity tree update frequency: f=0.010000
> [0000] [00002.2] engine_config: Absolute minimal timestep size: 3.195479e-17
> [0000] [00002.2] engine_config: Minimal timestep size (on time-line): 8.609401e-07
> [0000] [00002.2] engine_config: Maximal timestep size (on time-line): 7.052822e-03
> [0000] [00002.2] engine_config: Restarts will be dumped every 6.000000 hours
> [0000] [00002.2] main: engine_init took 33.041 ms.
> [0000] [00002.2] main: Running on 0 gas particles, 0 stars particles and 2097152 DM particles (2097152 gravity particles)
> [0000] [00002.2] main: from t=1.768e-05 until t=1.412e-02 with 4 ranks, 1 threads / rank and 1 task queues / rank (dt_min
> =1.000e-06, dt_max=1.000e-02)...
> [0000] [00002.2] engine_init_particles: Setting particles to a valid state...
> [0000] [00002.2] engine_init_particles: Computing initial gas densities.
> [0000] [00013.1] engine_init_particles: Converting internal energy variable.
> [0000] [00013.1] engine_init_particles: Running initial fake time-step.
> # Step Time Scale-factor Redshift Time-step Time-bins Updates g-Updates s-Updates Wall-clock time [ms] Props
> 0 1.767639e-05 0.0100000 99.0000000 0.000000e+00 1 56 0 2097152 0 18798.932 27
> 1 1.791649e-05 0.0100904 98.1045856 2.401017e-07 47 47 0 2097152 0 16715.176 1
> 2 1.815985e-05 0.0101815 97.2171889 2.433611e-07 47 48 0 2097152 0 16298.353 1
> 3 1.840652e-05 0.0102735 96.3377381 2.466683e-07 47 47 0 2097152 0 16344.778 1
> 4 1.865654e-05 0.0103663 95.4661620 2.500174e-07 47 49 0 2097152 0 16404.680 1
> 5 1.890995e-05 0.0104600 94.6023901 2.534146e-07 47 47 0 2097152 0 16651.715 1
> 6 1.916681e-05 0.0105545 93.7463526 2.568559e-07 47 48 0 2097152 0 16225.421 1
> 7 1.942715e-05 0.0106499 92.8979801 2.603453e-07 47 47 0 2097152 0 16430.977 1
> 8 1.969104e-05 0.0107461 92.0572041 2.638813e-07 47 50 0 2097152 0 16496.738 1
> 9 1.995850e-05 0.0108432 91.2239565 2.674656e-07 47 47 0 2097152 0 16713.441 1
> 10 2.022960e-05 0.0109411 90.3981699 2.710989e-07 47 48 0 2097152 0 16343.641 1
> 11 2.050438e-05 0.0110400 89.5797776 2.747807e-07 47 47 0 2097152 0 16477.797 1
> 12 2.078290e-05 0.0111397 88.7687132 2.785140e-07 47 49 0 2097152 0 16384.424 1
> 13 2.106519e-05 0.0112404 87.9649113 2.822958e-07 47 47 0 2097152 0 16743.326 1
> 14 2.135132e-05 0.0113419 87.1683067 2.861318e-07 47 48 0 2097152 0 16564.635 1
> 15 2.164134e-05 0.0114444 86.3788350 2.900164e-07 47 47 0 2097152 0 16564.104 1
> 16 2.193530e-05 0.0115478 85.5964323 2.939580e-07 47 51 0 2097152 0 16416.809 1
> 17 2.223325e-05 0.0116522 84.8210354 2.979486e-07 47 47 0 2097152 0 16801.801 1
> 18 2.253524e-05 0.0117574 84.0525815 3.019979e-07 47 48 0 2097152 0 16518.475 1
> 19 2.284134e-05 0.0118637 83.2910085 3.060980e-07 47 47 0 2097152 0 16528.070 1
> 20 2.315160e-05 0.0119709 82.5362547 3.102574e-07 47 49 0 2097152 0 16503.588 1
> 21 2.346607e-05 0.0120790 81.7882591 3.144703e-07 47 47 0 2097152 0 16892.643 1
> 22 2.378481e-05 0.0121881 81.0469611 3.187428e-07 47 48 0 2097152 0 16795.998 1
> 23 2.410788e-05 0.0122983 80.3123008 3.230716e-07 47 47 0 2097152 0 16568.832 1
> 24 2.443534e-05 0.0124094 79.5842188 3.274602e-07 47 50 0 2097152 0 16654.135 1
> 25 2.476725e-05 0.0125215 78.8626561 3.319082e-07 47 47 0 2097152 0 16858.889 1
> [0000] [00435.7] engine.c:engine_exchange_strays():1348: Do not have a proxy for the requested nodeID 1 for part with id=1035059, x=[2.962571e+01,1.048334e+01,2.338945e+01].
>
> slurm script:
>
> #!/bin/bash
> #SBATCH --account=oz009
> #SBATCH --ntasks=4
> #SBATCH --time=4:00:00
> #SBATCH --mem-per-cpu=4000
>
> module load gcc/6.4.0
> module load openmpi/3.0.0
> module load gsl/2.4
> module load hdf5/1.10.1
> #module load fftw/2.1.5
> module load fftw/3.3.7
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${HOME}/Codes/lib
> export OMPI_MCA_io=ompio
>
> #srun ./G3NR ./L10_N128_NR.param
> #srun ./G3CoolingSinks ./L10_N128_CoolingSinks_Tmin1e1.param
> #srun ./G3CoolingSinks ./L10_N128_CoolingSinks.param
> #srun ./G3CoolingSinksHeatingCUBA L10_N128_CoolingSinks_Heating.param
> #srun ./G3CoolingSinksSNe ./L10_N128_CoolingSinks_Tmin1e1_SNe.param
> srun ./swift_mpi -acG ./l40_n512.yml
>
> ----
>
> l40_n512.yml:
>
> # Define the system of units to use internally.
> InternalUnitSystem:
> UnitMass_in_cgs: 1.98848e43 # 10^10 M_sun
> UnitLength_in_cgs: 3.08567758e24 # 1 Mpc
> UnitVelocity_in_cgs: 1e5 # 1 km/s
> UnitCurrent_in_cgs: 1 # Amperes
> UnitTemp_in_cgs: 1 # Kelvin
>
> # Structure finding options
> #StructureFinding:
> # config_file_name: stf_input_6dfof_dmonly_sub.cfg
> # basename: ./stf
> # output_time_format: 1
> # scale_factor_first: 0.2
> # delta_time: 1.02
>
> Cosmology: # WMAP9 cosmology
> Omega_m: 0.3121
> Omega_lambda: 0.6879
> Omega_b: 0.0491
> h: 0.6751
> a_begin: 0.01 # z_ini = 99.
> a_end: 1.0 # z_end = 0.
>
> # Parameters governing the time integration
> TimeIntegration:
> dt_min: 1e-6
> dt_max: 1e-2
>
> # Parameters for the self-gravity scheme
> Gravity:
> eta: 0.025
> theta: 0.3
> comoving_softening: 0.002 # 1/30th of the mean inter-particle separation: 2 kpc
> max_physical_softening: 0.002 # 1/30th of the mean inter-particle separation: 2 kpc
> mesh_side_length: 512
>
> # Parameters governing the snapshots
> Snapshots:
> basename: snap
> delta_time: 1.02
> scale_factor_first: 0.1
>
> # Parameters governing the conserved quantities statistics
> Statistics:
> delta_time: 1.02
> scale_factor_first: 0.1
>
> Scheduler:
> max_top_level_cells: 8
> cell_split_size: 50
>
> # Parameters related to the initial conditions
> InitialConditions:
> file_name: ./ICs/L40_N512_z99_cdm_sigma8_0.815.swft.hdf5
> periodic: 1
> cleanup_h_factors: 1
> cleanup_velocity_factors: 1