|
|
The DiRAC Peta4-Skylake system (CDS3) in Cambridge consists of two systems. The older system (`skylake`) has nodes with Intel Skylake CPUs that are connected using an Omni-Path interconnect. SWIFT cannot run over this interconnect. The newer system (`cclake`) consists of Intel Cascade Lake CPU nodes connected using HDR Infiniband. By default, this system uses Intel MPI 2020. The following modules have been verified to work:
|
|
|
```
|
|
|
module load intel/bundles/complib/2020.2 \
|
|
|
hdf5-1.10.1-intel-17.0.4-aq7fy5w \
|
|
|
parmetis-4.0.3-intel-17.0.4-cfs7h6l \
|
|
|
fftw-3.3.6-pl2-intel-17.0.4-gtkahx5 \
|
|
|
metis-5.1.0-intel-17.0.4-r6z4bz6
|
|
|
```
|
|
|
The first module is also automatically loaded by `rhel7/default-ccl`, which is the default system module for the `cclake` nodes.
|
|
|
|
|
|
The module system on CDS3 is a dysfunctional mixture of a traditional module system and a more modern Spack module system. On top of that, some of the SWIFT dependencies seem to have candidates in system folders that are automatically available outside the module system. This means that you need to explicitly pass on all dependencies in the `configure` command to make sure the right libraries are detected:
|
|
|
```
|
|
|
./configure <YOUR CONFIGURATION OPTIONS> \
|
|
|
--with-parmetis=/usr/local/software/spack/spack-0.11.2/opt/spack/linux-rhel7-x86_64/intel-17.0.4/parmetis-4.0.3-cfs7h6lgzsct5n3ydvczqa56bit4hbfl \
|
|
|
--with-metis=/usr/local/software/spack/spack-0.11.2/opt/spack/linux-rhel7-x86_64/intel-17.0.4/metis-5.1.0-r6z4bz6frgdd7flrrmoyxccliij5fwm7 \
|
|
|
--with-hdf5=/usr/local/software/spack/spack-0.11.2/opt/spack/linux-rhel7-x86_64/intel-17.0.4/hdf5-1.10.1-aq7fy5w6euvtgmupdypmq7lo4eh3rilu/bin/h5pcc \
|
|
|
--with-fftw=/usr/local/software/spack/spack-0.11.2/opt/spack/linux-rhel7-x86_64/intel-17.0.4/fftw-3.3.6-pl2-gtkahx5xzfnrs5cieewd34btltupx2ld \
|
|
|
CC=mpiicc MPICC=mpiicc
|
|
|
```
|
|
|
|
|
|
Finally, there appears to be an issue with the UCX configuration of the interconnect that leads to weird runtime errors in the MPI library. To get around this, you need to manually instruct UCX to use Infiniband rather than RoCE (or a weird combination of both):
|
|
|
```
|
|
|
export UCX_NET_DEVICES=mlx5_0:1
|
|
|
```
|
|
|
An example submission script then looks like this:
|
|
|
```
|
|
|
#!/bin/bash -l
|
|
|
#SBATCH -J RUN_NAME
|
|
|
#SBATCH --ntasks=20 # twice the number of nodes we want
|
|
|
#SBATCH --cpus-per-task=28 # half the number of cores on a node
|
|
|
#SBATCH -o job.%J.out
|
|
|
#SBATCH -e job.%J.err
|
|
|
#SBATCH -p cclake # default Cascade Lake queue
|
|
|
#SBATCH -t 01:00:00 # time limit: adjust!
|
|
|
#SBATCH -A dirac-dp004-cpu # project name required for DP004 runs
|
|
|
#SBATCH --exclusive
|
|
|
|
|
|
module purge
|
|
|
module load slurm
|
|
|
module load intel/bundles/complib/2020.2 \
|
|
|
hdf5-1.10.1-intel-17.0.4-aq7fy5w \
|
|
|
parmetis-4.0.3-intel-17.0.4-cfs7h6l \
|
|
|
fftw-3.3.6-pl2-intel-17.0.4-gtkahx5 \
|
|
|
metis-5.1.0-intel-17.0.4-r6z4bz6
|
|
|
|
|
|
export UCX_NET_DEVICES=mlx5_0:1
|
|
|
|
|
|
cd $SLURM_SUBMIT_DIR
|
|
|
|
|
|
mpirun -n 20 ./swift_mpi <YOUR FLAGS> <YOUR PARAMETERFILE>.yml
|
|
|
``` |
|
|
\ No newline at end of file |