Multi-time step MPI problems.
Tests of the following batch script:
#!/bin/bash -l
#
# Batch script for bash users
#
#BSUB -L /bin/bash
#BSUB -n 4
#BSUB -J SWIFT-mpi-test
#BSUB -oo job.dump
#BSUB -eo job.err
#BSUB -q bench1
#BSUB -P durham
#BSUB -R span[ptile=1]
#BSUB -x
#BSUB -W 00:30
NTHREADS=12
module purge
module load swift
module load swift/c4/intel/intelmpi/5.1.2
mpirun -np 4 ../swift_mpi -t $NTHREADS -f SodShock/sodShock.hdf5 -m 0.01 -w 5000 -c 1. -d 1e-7 -e 0.01
Give rise to a deadlock unless the affinity code is completely disabled (we seem to have many threads sharing the same cores). Once the affinity code is removed, the job will run but occasions of negative time steps are seen, for instance:
42 0.005371 0.000122 1024127 490.381
43 0.004639 -0.000732 1024128 134.555
44 0.004883 0.000244 1024128 121.384
45 0.005127 0.000244 1024128 112.810
46 0.005371 0.000244 1024128 117.759
and the job eventually fails when h_max
requires
a regrid of the top-level cells. This never happened
for the SodShock with fixed dt.
For reference the SHA at this time was: 33caebf8