MPI Bcast for params
I'm not sure how this has arisen for me, but I did recently pull and recompile swift (from main). The setup did work before with no issues.
I'm having issues with MPI hangs, over a certain number of ranks, when it broadcasts the params file in swift.c
#ifdef WITH_MPI
/* Broadcast the parameter file */
MPI_Bcast(params, sizeof(struct swift_params), MPI_BYTE, 0, MPI_COMM_WORLD);
#endif
This is on c7-rp. With 10 ranks (14 tasks per rank, so 5 nodes), it works. Beyond that it hangs.
I was originally using mpi intel 2018, i tried a newer mpi, same thing. Although on intel 2021 it does at least give an error.
[0000] [00000.0] main: Reading runtime parameters from file 'params.yml'
[m7197:209413:0:209413] ib_mlx5_log.c:139 Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[m7197:209413:0:209413] ib_mlx5_log.c:139 RC QP 0x5c96 wqe[2]: SEND --e [va 0x2afcde9f5400 len 8256 lkey 0x5ff11]
[m7197:209414:0:209414] ib_mlx5_log.c:139 Transport retry count exceeded on mlx5_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[m7197:209414:0:209414] ib_mlx5_log.c:139 RC QP 0x5c91 wqe[0]: SEND --e [va 0x2aed6eddee80 len 8256 lkey 0x61336]
/cosma/local/software/ucx/ucx-1.8.1/src/uct/ib/mlx5/ib_mlx5_log.c: [ uct_ib_mlx5_completion_with_err() ]
...
129 }
130
131 ucs_log(log_level,
==> 132 "%s on "UCT_IB_IFACE_FMT"/%s (synd 0x%x vend 0x%x hw_synd %d/%d)\n"
133 "%s QP 0x%x wqe[%d]: %s",
134 err_info, UCT_IB_IFACE_ARG(iface),
135 uct_ib_iface_is_roce(iface) ? "RoCE" : "IB",
/cosma/local/software/ucx/ucx-1.8.1/src/uct/ib/mlx5/ib_mlx5_log.c: [ uct_ib_mlx5_completion_with_err() ]
...
129 }
130
131 ucs_log(log_level,
==> 132 "%s on "UCT_IB_IFACE_FMT"/%s (synd 0x%x vend 0x%x hw_synd %d/%d)\n"
133 "%s QP 0x%x wqe[%d]: %s",
134 err_info, UCT_IB_IFACE_ARG(iface),
135 uct_ib_iface_is_roce(iface) ? "RoCE" : "IB",
==== backtrace (tid: 209414) ====
0 0x00000000000206f9 uct_ib_mlx5_completion_with_err() /cosma/local/software/ucx/ucx-1.8.1/src/uct/ib/mlx5/ib_mlx5_log.c:132
1 0x0000000000045da1 uct_rc_mlx5_iface_handle_failure() /cosma/local/software/ucx/ucx-1.8.1/src/uct/ib/rc/accel/rc_mlx5_iface.c:217
2 0x0000000000040bdd uct_ib_mlx5_poll_cq() /cosma/local/software/ucx/ucx-1.8.1/src/uct/ib/mlx5/ib_mlx5.inl:81
3 0x000000000002a715 ucs_callbackq_dispatch() /cosma/local/software/ucx/ucx-1.8.1/src/ucs/datastruct/callbackq.h:211
4 0x000000000002a715 uct_worker_progress() /cosma/local/software/ucx/ucx-1.8.1/src/uct/api/uct.h:2221
5 0x000000000002a715 ucp_worker_progress() /cosma/local/software/ucx/ucx-1.8.1/src/ucp/core/ucp_worker.c:1951
6 0x000000000000a191 mlx_ep_progress() mlx_ep.c:0
7 0x0000000000020bcd ofi_cq_progress() osd.c:0
8 0x000000000002199b ofi_cq_readfrom() osd.c:0
9 0x0000000000663166 fi_cq_read() /usr/include/rdma/fi_eq.h:385
10 0x00000000001a8f4b MPIDI_Progress_test() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:181
11 0x00000000001a8f4b MPID_Progress_test() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:236
12 0x00000000001a8f4b MPID_Progress_wait() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:297
13 0x000000000080344b MPIR_Wait_impl() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/request/wait.c:40