Cosmo Volume stops after two outputs
When running the test:
mpirun -np 10 ./test_mindt_mpi -c 0.01 -t 1 -f CosmoVolume/cosmoVolume.hdf5 -m 0.6 -w 5000 -d 0.01 -g "1 5 2"
as a job distributed across 10 nodes of COSMA the job outputs two snapshots and then fails to generate any further output. At least one node is slightly busy, the others are in wait states of various kinds. Sampling the stack of the busy node we get the following:
> pstack-full 13602
Thread 3 (Thread 0x2ad6edefd700 (LWP 13603)):
#0 0x00000037256dc053 in poll () from /lib64/libc.so.6
#1 0x00002ad6ea0945c1 in dapl_select (arg=0x2ad6f00008c4) at ../../src/mpid/ch3/channels/nemesis/netmod/ofa/cm/dapl/openib_ucm/cm.c:102
#2 cm_thread (arg=0x2ad6f00008c4) at ../../src/mpid/ch3/channels/nemesis/netmod/ofa/cm/dapl/openib_ucm/cm.c:2265
#3 0x00002ad6ea0ead3d in dapli_thread_init (thread_draft=0x2ad6f00008c4) at ../../src/mpid/ch3/channels/nemesis/netmod/ofa/cm/dapl/udapl/linux/dapl_osd.c:593
#4 0x0000003725a077f1 in start_thread () from /lib64/libpthread.so.0
#5 0x00000037256e570d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x2ad6eea95700 (LWP 13618)):
#0 0x00000000004122d0 in timers_toc (t=14, tic=85256083164102309) at timers.h:70
#1 0x0000000000415782 in runner_dopair_subset_density (r=0xf95b60, ci=0x136a280, parts_i=0x4a06280, ind=0x2ad6eea8d1a0, count=1, cj=0x1360680) at runner_doiact.h:510
#2 0x0000000000450911 in runner_doghost (r=0xf95b60, c=0x136a280) at runner.c:705
#3 0x000000000044fe83 in runner_doghost (r=0xf95b60, c=0x136c080) at runner.c:561
#4 0x000000000044fe83 in runner_doghost (r=0xf95b60, c=0x1563480) at runner.c:561
#5 0x00000000004540b9 in runner_main (data=0xf95b60) at runner.c:1317
#6 0x0000003725a077f1 in start_thread () from /lib64/libpthread.so.0
#7 0x00000037256e570d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2ad6ec226660 (LWP 13602)):
#0 0x0000003725a0b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x000000000045f2b7 in engine_launch (e=0x7fff0766a750, nr_runners=1, mask=32702) at engine.c:1679
#2 0x000000000045f6f1 in engine_step (e=0x7fff0766a750) at engine.c:1761
#3 0x000000000040b5f4 in main (argc=15, argv=0x7fff0766ac88) at test.c:846
> pstack-full 13602
Thread 3 (Thread 0x2ad6edefd700 (LWP 13603)):
#0 0x00000037256dc053 in poll () from /lib64/libc.so.6
#1 0x00002ad6ea0945c1 in dapl_select (arg=0x2ad6f00008c4) at ../../src/mpid/ch3/channels/nemesis/netmod/ofa/cm/dapl/openib_ucm/cm.c:102
#2 cm_thread (arg=0x2ad6f00008c4) at ../../src/mpid/ch3/channels/nemesis/netmod/ofa/cm/dapl/openib_ucm/cm.c:2265
#3 0x00002ad6ea0ead3d in dapli_thread_init (thread_draft=0x2ad6f00008c4) at ../../src/mpid/ch3/channels/nemesis/netmod/ofa/cm/dapl/udapl/linux/dapl_osd.c:593
#4 0x0000003725a077f1 in start_thread () from /lib64/libpthread.so.0
#5 0x00000037256e570d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x2ad6eea95700 (LWP 13618)):
#0 runner_dopair_subset_density (r=0xf95b60, ci=0x136a280, parts_i=0x4a06280, ind=0x2ad6eea8d1a0, count=1, cj=0x1589480) at runner_doiact.h:452
#1 0x0000000000450965 in runner_doghost (r=0xf95b60, c=0x136a280) at runner.c:707
#2 0x000000000044fe83 in runner_doghost (r=0xf95b60, c=0x136c080) at runner.c:561
#3 0x000000000044fe83 in runner_doghost (r=0xf95b60, c=0x1563480) at runner.c:561
#4 0x00000000004540b9 in runner_main (data=0xf95b60) at runner.c:1317
#5 0x0000003725a077f1 in start_thread () from /lib64/libpthread.so.0
#6 0x00000037256e570d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2ad6ec226660 (LWP 13602)):
#0 0x0000003725a0b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x000000000045f2b7 in engine_launch (e=0x7fff0766a750, nr_runners=1, mask=32702) at engine.c:1679
#2 0x000000000045f6f1 in engine_step (e=0x7fff0766a750) at engine.c:1761
#3 0x000000000040b5f4 in main (argc=15, argv=0x7fff0766ac88) at test.c:846
> pstack-full 13602
Thread 3 (Thread 0x2ad6edefd700 (LWP 13603)):
#0 0x00000037256dc053 in poll () from /lib64/libc.so.6
#1 0x00002ad6ea0945c1 in dapl_select (arg=0x2ad6f00008c4) at ../../src/mpid/ch3/channels/nemesis/netmod/ofa/cm/dapl/openib_ucm/cm.c:102
#2 cm_thread (arg=0x2ad6f00008c4) at ../../src/mpid/ch3/channels/nemesis/netmod/ofa/cm/dapl/openib_ucm/cm.c:2265
#3 0x00002ad6ea0ead3d in dapli_thread_init (thread_draft=0x2ad6f00008c4) at ../../src/mpid/ch3/channels/nemesis/netmod/ofa/cm/dapl/udapl/linux/dapl_osd.c:593
#4 0x0000003725a077f1 in start_thread () from /lib64/libpthread.so.0
#5 0x00000037256e570d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x2ad6eea95700 (LWP 13618)):
#0 0x00000000004164c1 in runner_doself_subset_density (r=0xf95b60, ci=0x136a280, parts=0x4a06280, ind=0x2ad6eea8d1a0, count=1) at runner_doiact.h:688
#1 0x000000000045089a in runner_doghost (r=0xf95b60, c=0x136a280) at runner.c:698
#2 0x000000000044fe83 in runner_doghost (r=0xf95b60, c=0x136c080) at runner.c:561
#3 0x000000000044fe83 in runner_doghost (r=0xf95b60, c=0x1563480) at runner.c:561
#4 0x00000000004540b9 in runner_main (data=0xf95b60) at runner.c:1317
#5 0x0000003725a077f1 in start_thread () from /lib64/libpthread.so.0
#6 0x00000037256e570d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2ad6ec226660 (LWP 13602)):
#0 0x0000003725a0b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x000000000045f2b7 in engine_launch (e=0x7fff0766a750, nr_runners=1, mask=32702) at engine.c:1679
#2 0x000000000045f6f1 in engine_step (e=0x7fff0766a750) at engine.c:1761
#3 0x000000000040b5f4 in main (argc=15, argv=0x7fff0766ac88) at test.c:846
So the code seems to be stuck in runner_doghost. Checking the code it seems that a
while loop never exits and a test for the redo
variable is stuck.