EAGLE_50 self gravity crashes
Testing some other work and have come across a test case that fails when using self gravity on the EAGLE_50 volume. The same test used to work until recently.
Here is the submission script:
#!/bin/bash -l
#
# Batch script for bash users
#
#BSUB -L /bin/bash
#BSUB -n 6
#BSUB -J SWIFT-mpi-test
#BSUB -oo job.%J.dump
#BSUB -eo job.%J.err
#BSUB -q cosma5
#BSUB -P dp004
#BSUB -R span[ptile=1]
#BSUB -x
#BSUB -W 1:00
module purge
module load swift
module load swift/c5/intel/intelmpi/2017-parallel
module load parmetis
mpirun -np 6 ../swift_mpi -a -G -S -t 16 -s -v 1 \
-PDomainDecomposition:trigger:500 \
-PScheduler:max_top_level_cells:40 \
-n 5000 eagle_50.yml
exit
This core dumps shortly after step 0:
# Step Time Scale-factor Time-step Time-bins Updates g-Updates s-Updates Wall-clock time [ms] Props
0 0.000000e+00 1.000000e+00 0.000000e+00 1 56 404421250 850466735 20786477 275926.000 3
[0000] [00591.8] space_regrid: h_max is 4.176e-01 (cell_min=8.387e-01).
[0000] [00593.0] space_regrid: took 1245.531 ms.
[0000] [00593.7] space_parts_get_cell_index: took 650.545 ms.
[0000] [00594.6] space_gparts_get_cell_index: took 896.638 ms.
[0000] [00594.6] space_sparts_get_cell_index: took 22.852 ms.
[0000] [00595.1] engine_exchange_strays: sent out 0/0/0 parts/gparts/sparts, got 0/0/0 back.
[0000] [00595.1] engine_exchange_strays: took 123.335 ms.
[0000] [00595.4] space_parts_sort: took 343.727 ms.
[0000] [00595.5] space_sparts_sort: took 28.012 ms.
[0000] [00598.1] space_gparts_sort: took 939.512 ms.
[0000] [00612.5] space_split: took 9135.604 ms.
[0000] [00612.5] space_rebuild: took 20691.884 ms.
[0000] [00615.7] engine_exchange_cells: took 3214.453 ms.
[0000] [00618.9] engine_exchange_proxy_multipoles: took 2742.721 ms.
[0000] [00618.9] engine_estimate_nr_tasks: tasks per cell estimated as: 6, maximum tasks: 29644980
[0000] [00622.1] scheduler_reweight: took 147.380 ms.
[0000] [00622.1] engine_maketasks: took 3193.015 ms (including reweight).
[0000] [00622.2] space_list_cells_with_tasks: Have 16740 local cells (total=64000)
[0000] [00622.2] engine_marktasks: took 6.011 ms.
[0000] [00622.2] engine_print_task_counts: Total = 1494189 (per cell = 1)
[0000] [00622.2] engine_print_task_counts: task counts are [ none=0 sort=1 self=1 pair=112 sub_self=2 sub_pair=36 init_grav=1 ghost_in=1 ghost=1 ghost_out=1 extra_ghost=0 drift_part=1 drift_gpart=1 end_force=1 kick1=1 kick2=1 timestep=1 send=0 recv=0 grav_top_level=1 grav_long_range=1 grav_ghost_in=1331 grav_ghost_out=1331 grav_mm=0 grav_down=1 cooling=0 sourceterms=0 skipped=1491362 ]
[0000] [00622.2] engine_print_task_counts: nr_parts = 90379552.
[0000] [00622.2] engine_print_task_counts: nr_gparts = 192065966.
[0000] [00622.2] engine_print_task_counts: nr_sparts = 5611531.
[0000] [00622.2] engine_print_task_counts: took 11.007 ms.
[0000] [00622.2] engine_rebuild: took 30423.971 ms.
[0000] [00622.2] engine_unskip: took 3.050 ms.
[0000] [00622.2] engine_prepare: took 30427.039 ms (including unskip and reweight).
[0000] [00622.2] engine_drift_top_multipoles: took 1.495 ms.
[0000] [00622.2] engine_print_task_counts: Total = 1494189 (per cell = 1)
[0000] [00622.2] engine_print_task_counts: task counts are [ none=0 sort=1 self=1 pair=112 sub_self=2 sub_pair=36 init_grav=1 ghost_in=1 ghost=1 ghost_out=1 extra_ghost=0 drift_part=1 drift_gpart=1 end_force=1 kick1=1 kick2=1 timestep=1 send=0 recv=0 grav_top_level=1 grav_long_range=1 grav_ghost_in=1331 grav_ghost_out=1331 grav_mm=0 grav_down=1 cooling=0 sourceterms=0 skipped=1491362 ]
[0000] [00622.2] engine_print_task_counts: nr_parts = 90379552.
[0000] [00622.2] engine_print_task_counts: nr_gparts = 192065966.
[0000] [00622.2] engine_print_task_counts: nr_sparts = 5611531.
[0000] [00622.2] engine_print_task_counts: took 10.866 ms.
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 7207 RUNNING AT m5415
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Using the module: swift/c5/intel/intelmpi/2017-parallel
and the configuration options: ./configure --with-metis --enable-debug
.
The core files are truncated so may not be useful, they point to an issue in:
#0 runner_do_grav_fft () at runner_doiact_fft.c:93
Tried running under ddt and that aborted with a failed to converge error. Will try with debugging checks enabled next and see what that says.