EAGLE-50 fails with 'PARMETIS ERROR: sum weight for constraint 0 is zero'
Yesterday I submitted a grid of EAGLE 50Mpc boxes running with full physics and using MPI with parmetis. They reached their 24 hour run time limit and resubmitted automatically. One of them failed on the first timestep after the restart. The last few lines of stdout look like this:
[0013] [00022.4] engine_compute_next_fof_time: Next FoF time set to a=1.463327e-01.
[0000] [00022.4] main: engine_config took 1582.840 ms.
# Step Time Scale-factor Redshift Time-step Time-bins Updates g-Updates s-Updates b-Updates Wall-clock time [ms] Props
63166 9.898754e-04 0.1460576 5.8466137 6.856289e-09 36 47 424678394 850518004 574189 6413 0.000 0
PARMETIS ERROR: sum weight for constraint 0 is zero.
The wall clock time for the time step is reported as zero. I imagine this might be why parmetis is complaining about zero weights.
I'm running with commit 77dc3b54 from master and starting swift with this:
ccc_mprun ${codedir}/swiftsim/examples/swift_mpi --verbose=0 \
--param=Restarts:resubmit_on_exit:1 \
--param=Restarts:resubmit_command:${codedir}/irene/resub.sh \
--param=Restarts:max_run_time:23.0 \
--param=Snapshots:output_list:${codedir}/output_times.txt \
--param=InitialConditions:file_name:/ccc/store/cont005/ra4707/hellyjoh/EAGLE_ICs/SwiftICs/EAGLE_L0050N0752_ICs.hdf5 \
--param=EAGLECooling:dir_name:${codedir}/Data/coolingtables/ \
--param=EAGLEFeedback:filename:${codedir}/Data/yieldtables/ \
--pin --cosmology ${eagle_flags} \
--threads=24 eagle_50.yml
Here's the stdout file: swift.3857086.1.out. Stderr just has messages saying the parmetis function call failed, e.g.:
[0000] [00049.9] partition.c:pick_parmetis():1036: Call to ParMETIS_V3_AdaptiveRepart failed.
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
In: PMI_Abort(-1, application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3)
The input parameters were eagle_50.yml.