Running 2 MPI ranks per node Intel MPI 2017 bug
When running SWIFT on 4 nodes with 8 MPI ranks using Intel MPI 2017 I get the following error:
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(805): fail failed
MPID_Init(1865)......: fail failed
MPIR_Comm_commit(711): fail failed
(unknown)(): Other MPI error
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(805): fail failed
MPID_Init(1865)......: fail failed
MPIR_Comm_commit(711): fail failed
(unknown)(): Other MPI error
verbose output with I_MPI_DEBUG=50
:
[0] MPI startup(): Intel(R) MPI Library, Version 2017 Update 2 Build 20170125 (id: 16752)
[0] MPI startup(): Copyright (C) 2003-2017 Intel Corporation. All rights reserved.
[0] MPI startup(): Multi-threaded optimized library
[0] my_dlopen(): trying to dlopen: libdat2.so.2
[4] my_dlopen(): trying to dlopen: libdat2.so.2
[1] my_dlopen(): trying to dlopen: libdat2.so.2
[3] my_dlopen(): trying to dlopen: libdat2.so.2
[2] my_dlopen(): trying to dlopen: libdat2.so.2
[4] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[4] my_dlopen(): trying to dlopen: libdat2.so.2
[5] my_dlopen(): trying to dlopen: libdat2.so.2
[7] my_dlopen(): trying to dlopen: libdat2.so.2
[6] my_dlopen(): trying to dlopen: libdat2.so.2
[0] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[0] my_dlopen(): trying to dlopen: libdat2.so.2
[5] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[5] my_dlopen(): trying to dlopen: libdat2.so.2
[3] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[3] my_dlopen(): trying to dlopen: libdat2.so.2
[2] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[2] my_dlopen(): trying to dlopen: libdat2.so.2
[1] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[1] my_dlopen(): trying to dlopen: libdat2.so.2
[7] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[7] my_dlopen(): trying to dlopen: libdat2.so.2
[6] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[6] my_dlopen(): trying to dlopen: libdat2.so.2
[1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[1] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[1] my_dlopen(): trying to dlopen: libdat2.so.2
[5] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[5] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[5] my_dlopen(): trying to dlopen: libdat2.so.2
[3] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[3] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[3] my_dlopen(): trying to dlopen: libdat2.so.2
[7] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[7] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[7] my_dlopen(): trying to dlopen: libdat2.so.2
[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[0] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[0] my_dlopen(): trying to dlopen: libdat2.so.2
[2] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[2] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[2] my_dlopen(): trying to dlopen: libdat2.so.2
[4] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[4] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[4] my_dlopen(): trying to dlopen: libdat2.so.2
[6] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[6] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[6] my_dlopen(): trying to dlopen: libdat2.so.2
[1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[1] MPI startup(): shm and dapl data transfer modes
[5] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[5] MPI startup(): shm and dapl data transfer modes
[7] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[7] MPI startup(): shm and dapl data transfer modes
[3] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[3] MPI startup(): shm and dapl data transfer modes
[0] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[0] MPI startup(): shm and dapl data transfer modes
[4] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[4] MPI startup(): shm and dapl data transfer modes
[2] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[2] MPI startup(): shm and dapl data transfer modes
[6] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[6] MPI startup(): shm and dapl data transfer modes
[0] MPI startup(): static connections storm algo
[3] I_MPI_Init_shm_colls_space(): Cannot create shm object: /shm-col-space-5500-2-55C70B39E1EEC errno=Permission denied
[3] I_MPI_Init_shm_colls_space(): Something goes wrong in shared memory initialization (Permission denied)
[2] I_MPI_Init_shm_colls_space(): Cannot create shm object: /shm-col-space-26055-2-55C70B39E1F58 errno=Permission denied
[2] I_MPI_Init_shm_colls_space(): Something goes wrong in shared memory initialization (Permission denied)
I have attached my submission script.
I only noticed this when John had the same problem with one of his applications. I have tested the same setup with Intel MPI 2016 and it runs fine.
The issue seems to be something to do with shared memory.