Commit e1624da8 authored by Matthieu Schaller's avatar Matthieu Schaller
Browse files

Conclusion & Spell-checking

parent 3ec85a46
...@@ -374,3 +374,19 @@ archivePrefix = "arXiv", ...@@ -374,3 +374,19 @@ archivePrefix = "arXiv",
publisher={IEEE} publisher={IEEE}
} }
@article{ref:Gonnet2015,
author = {Pedro Gonnet},
title = {Efficient and Scalable Algorithms for Smoothed Particle Hydrodynamics
on Hybrid Shared/Distributed-Memory Architectures},
journal = {{SIAM} J. Scientific Computing},
volume = {37},
number = {1},
year = {2015},
url = {http://dx.doi.org/10.1137/140964266},
doi = {10.1137/140964266},
timestamp = {Thu, 12 Mar 2015 10:30:34 +0100},
biburl = {http://dblp.uni-trier.de/rec/bib/journals/siamsc/Gonnet15},
bibsource = {dblp computer science bibliography, http://dblp.org}
}
...@@ -22,6 +22,7 @@ ...@@ -22,6 +22,7 @@
% Some acronyms % Some acronyms
\newcommand{\gadget}{{\sc Gadget-2}\xspace}
\newcommand{\swift}{{\sc swift}\xspace} \newcommand{\swift}{{\sc swift}\xspace}
\newcommand{\qs}{{\sc QuickShed}\xspace} \newcommand{\qs}{{\sc QuickShed}\xspace}
...@@ -412,11 +413,11 @@ and in which communication latencies are negligible. ...@@ -412,11 +413,11 @@ and in which communication latencies are negligible.
\caption{Task hierarchy for the SPH computations in \swift, \caption{Task hierarchy for the SPH computations in \swift,
including communication tasks. Arrows indicate dependencies, including communication tasks. Arrows indicate dependencies,
i.e.~an arrow from task $A$ to task $B$ indicates that $A$ i.e.~an arrow from task $A$ to task $B$ indicates that $A$
depends on $B$. The task color corresponds to the cell or depends on $B$. The task colour corresponds to the cell or
cells it operates on, e.g.~the density and foce tasks work cells it operates on, e.g.~the density and force tasks work
on individual cells or pairs of cells. on individual cells or pairs of cells.
The blue cell data is on a separate rank as the yellow and The blue cell data is on a separate rank as the yellow and
purple cells, and thus its data must be sent accross during purple cells, and thus its data must be sent across during
the computation using {\tt send}/{\tt recv} tasks (diamond-shaped).} the computation using {\tt send}/{\tt recv} tasks (diamond-shaped).}
\label{tasks} \label{tasks}
\end{figure} \end{figure}
...@@ -427,7 +428,7 @@ and in which communication latencies are negligible. ...@@ -427,7 +428,7 @@ and in which communication latencies are negligible.
Although each particle cell resides on a specific rank, the particle Although each particle cell resides on a specific rank, the particle
data still needs to be sent to neighbouring ranks which contain data still needs to be sent to neighbouring ranks which contain
tasks that operate thereon. tasks that operate thereon.
This communication must happen twice at each timestep: once to send This communication must happen twice at each time-step: once to send
the particle positions for the density computation, and again the particle positions for the density computation, and again
once the densities have been aggregated locally for the force once the densities have been aggregated locally for the force
computation. computation.
...@@ -435,18 +436,18 @@ computation. ...@@ -435,18 +436,18 @@ computation.
Most distributed-memory codes based on MPI \cite{ref:Snir1998} Most distributed-memory codes based on MPI \cite{ref:Snir1998}
separate computation and communication into distinct steps, i.e.~all separate computation and communication into distinct steps, i.e.~all
the ranks first exchange data, and only once the data exchange is the ranks first exchange data, and only once the data exchange is
complete, computaiton starts. Further data exchanges only happen complete, computation starts. Further data exchanges only happen
once computation has finished, and so on. once computation has finished, and so on.
This approach, although conceptually simple and easy to implement, This approach, although conceptually simple and easy to implement,
has three major drawbacks: has three major drawbacks:
\begin{itemize} \begin{itemize}
\item The frequent syncrhonization points between communication \item The frequent synchronisation points between communication
and computation exacerbate load imbalances, and computation exacerbate load imbalances,
\item The communiation phase consists mainly of waiting on \item The communication phase consists mainly of waiting on
latencies, during which the node's CPUs usually run idle, and latencies, during which the node's CPUs usually run idle, and
\item During the computation phase, the communication network \item During the computation phase, the communication network
is left completely unused, whereas during the communication is left completely unused, whereas during the communication
pase, all ranks attempt to use it at the same time. phase, all ranks attempt to use it at the same time.
\end{itemize} \end{itemize}
It is for these reasons that in \swift we opted for a fully It is for these reasons that in \swift we opted for a fully
...@@ -468,7 +469,7 @@ At the destination, the task is made dependent of the {\tt recv} ...@@ -468,7 +469,7 @@ At the destination, the task is made dependent of the {\tt recv}
task, i.e.~the task can only execute once the data has actually task, i.e.~the task can only execute once the data has actually
been received. been received.
This is illustrated in Figure~\ref{tasks}, where data is exchanged across This is illustrated in Figure~\ref{tasks}, where data is exchanged across
two ranks for the denisty and force computations and the extra two ranks for the density and force computations and the extra
dependencies are shown in red. dependencies are shown in red.
The communication itself is implemented using the non-blocking The communication itself is implemented using the non-blocking
...@@ -476,18 +477,18 @@ The communication itself is implemented using the non-blocking ...@@ -476,18 +477,18 @@ The communication itself is implemented using the non-blocking
communication, and {\tt MPI\_Test} to check if the communication communication, and {\tt MPI\_Test} to check if the communication
was successful and resolve the communication task's dependencies. was successful and resolve the communication task's dependencies.
In the task-based scheme, strictly local tasks which do not rely In the task-based scheme, strictly local tasks which do not rely
on commuication tasks are executed first. on communication tasks are executed first.
As the data from other ranks arrive, the corresponding non-local As the data from other ranks arrive, the corresponding non-local
tasks are unlocked and are executed whenever a thread picks them up. tasks are unlocked and are executed whenever a thread picks them up.
One direct consequence of this approach is that instead of a single One direct consequence of this approach is that instead of a single
{\tt send}/{\tt recv} call between each pair of neighboring ranks, {\tt send}/{\tt recv} call between each pair of neighbouring ranks,
one such pair is generated for each particle cell. one such pair is generated for each particle cell.
This type of communication, i.e.~several small messages instead of This type of communication, i.e.~several small messages instead of
one large message, is usually discouraged since the sum of the latencies one large message, is usually discouraged since the sum of the latencies
for the small messages is usually much larger than the latency of for the small messages is usually much larger than the latency of
the single large message. the single large message.
This, however, is not a concern since nobody is acutally waiting This, however, is not a concern since nobody is actually waiting
to receive the messages in order and the latencies are covered to receive the messages in order and the latencies are covered
by local computations. by local computations.
A nice side-effect of this approach is that communication no longer A nice side-effect of this approach is that communication no longer
...@@ -602,7 +603,7 @@ Infiniband FDR10 non-blocking Tree. Islands are then connected using a 4:1 ...@@ -602,7 +603,7 @@ Infiniband FDR10 non-blocking Tree. Islands are then connected using a 4:1
Pruned Tree. Pruned Tree.
This system is similar in nature to the cosma-5 system used in the previous set This system is similar in nature to the cosma-5 system used in the previous set
of tests but is much larger, allowing us to demonstrate the scalibility of our of tests but is much larger, allowing us to demonstrate the scalability of our
framework on the largest systems. framework on the largest systems.
The code was compiled with the Intel compiler version \textsc{2015.5.223} and The code was compiled with the Intel compiler version \textsc{2015.5.223} and
...@@ -664,7 +665,7 @@ test are shown on Fig.~\ref{fig:JUQUEEN2}. ...@@ -664,7 +665,7 @@ test are shown on Fig.~\ref{fig:JUQUEEN2}.
machine (see text for hardware description) with varying number of machine (see text for hardware description) with varying number of
threads per node. The number of MPI ranks per node is kept fixed to threads per node. The number of MPI ranks per node is kept fixed to
one. The code displays excellent scaling even when all the cores and one. The code displays excellent scaling even when all the cores and
hardware multithreads are in use. \label{fig:JUQUEEN1}} hardware multi-threads are in use. \label{fig:JUQUEEN1}}
\end{figure} \end{figure}
...@@ -692,13 +693,53 @@ test are shown on Fig.~\ref{fig:JUQUEEN2}. ...@@ -692,13 +693,53 @@ test are shown on Fig.~\ref{fig:JUQUEEN2}.
\section{Discussion \& Conclusion} \section{Discussion \& Conclusion}
The strong scaling results presented in the previous on three different machines
demonstrate the ability of our framework to scale on both small commodity
When running on the SuperMUC machine with 32 nodes (512 cores), each MPI rank machines thanks to the use of task-based parallelism at the node level and on
contains approximately $1.6\times10^7$ particles in $2.5\times10^5$ the largest machines (Tier-0 systems) currently available thanks to the
cells. \swift will generate around $58,000$ point-to-point asynchronous MPI asynchronous communications. We stress that these have been obtained for a
communications (a pair of \texttt{Isend} and \texttt{Irecv}) per node every realistic test case without any micro-level optimisation nor explicit
time-step. vectorisation.
Excellent strong scaling is also achieved when increasing the number of threads
per node (i.e. per MPI rank, see fig.~\ref{fig:JUQUEEN1}), demonstrating that
the description of MPI (asynchronous) communications as tasks within our
framework is not a bottleneck. One common conception in HPC is that the number
of MPI communications between nodes should be kept to a minimum to optimise the
efficiency of the calculation. Our approach does exactly the opposite with large
number of point-to-point communications between pairs of nodes occurring over the
course of one time-step. For instance, on the SuperMUC machine with 32 nodes (512
cores), each MPI rank contains approximately $1.6\times10^7$ particles in
$2.5\times10^5$ cells. \swift will generate around $58,000$ point-to-point
asynchronous MPI communications (a pair of \texttt{Isend} and \texttt{Irecv})
per node, a number discouraged by many practitioners. Dispatching communications
over the course of the calculation and not in short bursts as is commonly done
may also help lower the load on the network and reduce the decrease in
efficiency due to the finite bandwidth of the Infiniband network.
One time-step on $8,192$ nodes of the JUQUEEN machine takes $63~\rm{ms}$ of
wall-clock time. All the loading of the tasks, communications and running of the
tasks takes place in that short amount of time. Our framework can hence
load-balance a calculation over $2.6\times10^5$ threads with a very good
efficiency.
We stress, as was previously demonstrated by \cite{ref:Gonnet2015}, that \swift
is also much faster than the \gadget code \cite{Springel2005}, the
\emph{de-facto} standard in the field of particle-based cosmological
simulations. For instance, the simulation setup that was run on the cosma-5
system takes $2.9~\rm{s}$ of wall-clock time per time-step on $256$ cores using
\swift whilst the default \gadget code on exactly the same setup with the same
number of cores requires $32~\rm{s}$. Our code is hence displaying a factor
$>10$ performance increase compared to \gadget. The excellent scaling
performance of \swift allows us to push this number further by simply increasing
the number of cores, whilst \gadget reaches its peak speed (for this problem) at
around 300 cores and stops scaling beyond that. This unprecedented scaling
ability combined with future work on vectorisation of the calculations within
each task will hopefully make \swift an important tool for future simulations in
cosmology and help push the entire field to a new level.
\swift, its documentation and the test cases presented in this paper are all
available at the address \web.
%##################################################################################################### %#####################################################################################################
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment