\documentclass{sig-alternate-05-2015}
\usepackage{times,amsmath,amsfonts,amssymb,epstopdf,xspace}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage[usenames]{color}


\newcommand{\red}[1]{{\textcolor{red}{#1}}}

\pdfminorversion=7


%Journals
\newcommand{\mnras}{MNRAS}
\newcommand{\jcap}{JCAP}
\newcommand{\physrep}{Phys.~Rep.}   % Physics Reports
\newcommand{\apjs}{ApJS}

% Latex tricks
\newcommand{\oh}[1]{\mbox{$ {\mathcal O}( #1 ) $}}
\newcommand{\eqn}[1] {(\ref{eqn:#1})}
\makeatletter
\newcommand{\pushright}[1]{\ifmeasuring@#1\else\omit\hfill$\displaystyle#1$\fi\ignorespaces}
\makeatother

% Some acronyms
\newcommand{\gadget}{{\sc Gadget-2}\xspace}
\newcommand{\swift}{{\sc swift}\xspace}
\newcommand{\qs}{{\sc QuickSched}\xspace}

% Webpage
\newcommand{\web}{\url{www.swiftsim.com}}


%#####################################################################################################

\begin{document}

\CopyrightYear{2016}
\setcopyright{acmlicensed}
\conferenceinfo{PASC '16,}{June 08 - 10, 2016, Lausanne, Switzerland}
\isbn{978-1-4503-4126-4/16/06}\acmPrice{\$15.00}
\doi{http://dx.doi.org/10.1145/2929908.2929916}


\title{{\ttlit SWIFT}: Using task-based parallelism, fully asynchronous
communication, and graph partition-based domain decomposition for
strong scaling on more than 100\,000 cores.}

\numberofauthors{4}
  
\author{
\alignauthor
       Matthieu~Schaller\\
       \affaddr{Institute for Computational Cosmology (ICC)}\\
       \affaddr{Department of Physics}\\
       \affaddr{Durham University}\\
       \affaddr{Durham DH1 3LE, UK}\\
       \email{\footnotesize \url{schaller@strw.leidenuniv.nl}}
       \alignauthor
      Pedro~Gonnet\\
      \affaddr{School of Engineering and Computing Sciences}\\
      \affaddr{Durham University}\\
      \affaddr{Durham DH1 3LE, UK}\\ \vspace{1ex}
      \affaddr{Google Switzerland GmbH}\\
      \affaddr{8002 Z\"urich, Switzerland}\\
\and
\alignauthor
       Aidan~B.~G.~Chalk\\
       \affaddr{School of Engineering and Computing Sciences}\\
       \affaddr{Durham University}\\
       \affaddr{Durham DH1 3LE, UK}\\
\alignauthor
       Peter~W.~Draper\\
       \affaddr{Institute for Computational Cosmology (ICC)}\\
       \affaddr{Department of Physics}\\
       \affaddr{Durham University}\\
       \affaddr{Durham DH1 3LE, UK}\\
}

\date{\today}

\maketitle

%#####################################################################################################

\begin{abstract}
  We present a new open-source cosmological code, called \swift, designed to
  solve the equations of hydrodynamics using a particle-based approach (Smooth
  Particle Hydrodynamics) on hybrid shared / distributed-memory architectures.
  \swift was designed from the bottom up to provide excellent {\em strong scaling}
  on both commodity clusters (Tier-2 systems) and Top100-supercomputers
  (Tier-0 systems), without relying on architecture-specific features
  or specialized accelerator hardware.
  This performance is due to three main computational approaches:

  \begin{itemize}
    
    \item \textbf{Task-based parallelism} for shared-memory
      parallelism, which provides fine-grained load balancing and
      thus strong scaling on large numbers of cores.

    \item \textbf{Graph-based domain decomposition}, which uses
      the task graph to decompose the simulation
      domain such that the {\em work}, as opposed to just the {\em data},
      as is the case with most partitioning schemes, is equally distributed
      across all nodes.

    \item \textbf{Fully dynamic and asynchronous communication},
      in which communication is modelled as just another task in
      the task-based scheme, sending data whenever it is ready and
      deferring on tasks that rely on data from other nodes
      until it arrives.

  \end{itemize}
  
  In order to use these approaches, the code had to be re-written from
  scratch, and the algorithms therein adapted to the task-based paradigm.
  As a result, we can show upwards of 60\% parallel efficiency for 
  moderate-sized problems when increasing the number of cores 512-fold,
  on both x86-based and Power8-based architectures.
    
\end{abstract}


\keywords{Physics; Cosmology; Fluid dynamics; Smooth Particle Hydrodynamics;
  Task-based parallelism; Asynchronous data transfer; Extreme scaling}

%#####################################################################################################


\section{Introduction}

For the past decade physical limitations have kept the speed of
individual processor cores constrained, so instead of getting {\em
  faster}, computers are getting {\em more parallel}.  Systems
containing up to 64 general-purpose cores are becoming commonplace,
and the number of cores can be expected to continue growing
exponentially, e.g.~following Moore's law, much in the same way
processor speeds were up until a few years ago.

As a consequence, in order to get faster, computer programs need to
rely on better exploitation of this massive parallelism, i.e.~ they
need to exhibit {\em strong scaling}, the ability to run roughly $N$
times faster when executed on $N$ times as many processors.  Without
strong scaling, massive parallelism can still be used to tackle ever
{\em larger} problems, but not make fixed-size problems {\em faster}.

Although this switch from growth in speed to growth in parallelism
has been anticipated and observed for quite some time, very little
has changed in terms of how we design and implement parallel
computations.
Branch-and-bound synchronous parallelism using
OpenMP\cite{ref:Dagum1998} and MPI\cite{ref:Snir1998}, as well as domain
decompositions based on geometry or space-filling curves \cite{warren1993parallel}
are still commonplace, despite both the
architectures and problem scales having changed dramatically since
their introduction.

The design and implementation of \swift\footnote{
\swift is an open-source software project and the latest version of
the source code, along with all the data needed to run the test cased
presented in this paper, can be downloaded at \web.}
\cite{gonnet2013swift,theuns2015swift,ref:Gonnet2015}, a large-scale
cosmological simulation code built from scratch, provided the perfect
opportunity to test some newer
approaches, i.e.~task-based parallelism, fully asynchronous communication, and
graph partition-based domain decompositions.

This paper describes these techniques, which are not exclusive to
cosmological simulations or any specific architecture, as well as
the results obtained with them.
While \cite{gonnet2013swift,ref:Gonnet2015} already describes the underlying algorithms
in more detail, in this paper we focus on the parallelization strategy
and the results obtained  on larger Tier-0 systems.


%#####################################################################################################

\section{Particle-based hydrodynamics}

Smoothed Particle Hydrodynamics \cite{Gingold1977,Price2012} (SPH) uses
particles to represent fluids.  Each particle $p_i$ has a position $\mathbf
x_i$, velocity $\mathbf v_i$, internal energy $u_i$, mass $m_i$, and a smoothing
length $h_i$.  The particles are used to interpolate any quantity $Q$ at any
point in space as a weighted sum over the particles:
%
\begin{equation}
    Q(\mathbf r) = \sum_i m_i \frac{Q_i}{\rho_i} W( \|\mathbf r - \mathbf r_i\| , h )
    \label{eqn:interp}
\end{equation}
%
where $Q_i$ is the quantity at the $i$th particle, $h$ is the {\em smoothing
  length}, i.e.~the radius of the sphere within which data will be considered
for the interpolation, and $W(r,h)$ is the {\em smoothing kernel} or {\em
  smoothing function}.  Several different forms for $W(r,h)$ exist, each with
their own specific benefits and drawbacks.

The particle density $\rho_i$ used in \eqn{interp} is itself computed similarly:
%
\begin{equation}
    \rho_i = \sum_{j,~r_{ij} < h_i} m_j W(r_{ij},h_i)
    \label{eqn:rho}
\end{equation}
%
where $r_{ij} = \|\mathbf{r_i}-\mathbf{r_j}\|$ is the Euclidean distance between
particles $p_i$ and $p_j$.  In compressible simulations, the smoothing length
$h_i$ of each particle is chosen such that the number of neighbours with which
it interacts is kept more or less constant, and can result in smoothing lengths
spanning several orders of magnitudes within the same simulation.

Once the densities $\rho_i$ have been computed, the time derivatives of the
velocity and internal energy, which require $\rho_i$, are
computed as follows:
%
\begin{align}
    \frac{dv_i}{dt} & = -\sum_{j,~r_{ij} < \hat{h}_{ij}} m_j \left[
        \frac{P_i}{\Omega_i\rho_i^2}\nabla_rW(r_{ij},h_i)\right. + \label{eqn:dvdt}\\
        & \pushright{\left.\frac{P_j}{\Omega_j\rho_j^2}\nabla_rW(r_{ij},h_j) \right], \nonumber} \\ 
    \frac{du_i}{dt} & = \frac{P_i}{\Omega_i\rho_i^2} \sum_{j,~r_{ij} < h_i} m_j(\mathbf v_i - \mathbf v_j) \cdot \nabla_rW(r_{ij},h_i), \label{eqn:dudt}
\end{align}
%
where $\hat{h}_{ij} = \max\{h_i,h_j\}$, and the particle pressure $P_i=\rho_i
u_i (\gamma-1)$ and correction term $\Omega_i=1 +
\frac{h_i}{3\rho_i}\frac{\partial \rho}{\partial h}$ are computed on the fly.

The computations in \eqn{rho}, \eqn{dvdt}, and \eqn{dudt} involve finding all
pairs of particles within range of each other.  Any particle $p_j$ is {\em
  within range} of a particle $p_i$ if the distance between $p_i$ and $p_j$ is
smaller or equal to the smoothing distance $h_i$ of $p_i$, e.g.~as is done in
\eqn{rho}.  Note that since particle smoothing lengths may vary between
particles, this association is not symmetric, i.e.~$p_j$ may be in range of
$p_i$, but $p_i$ not in range of $p_j$.  If $r_{ij} < \max\{h_i,h_j\}$, as is
required in \eqn{dvdt}, then particles $p_i$ and $p_j$ are within range {\em of
each other}.

The computation thus proceeds in two distinct stages that are evaluated
separately:
\begin{enumerate}
    \item {\em Density} computation: For each particle $p_i$,
        loop over all particles $p_j$ within range of $p_i$ and evaluate
        \eqn{rho}.
    \item {\em Force} computation: For each particle $p_i$,
        loop over all particles $p_j$
        within range of each other and evaluate \eqn{dvdt} and \eqn{dudt}.
\end{enumerate}

Finding the interacting neighbours for each particle constitutes the
bulk of the computation.  Many codes, e.g. in Astrophysics simulations
\cite{Gingold1977}, rely on spatial {\em trees} for neighbour finding
\cite{Gingold1977,Hernquist1989,Springel2005,Wadsley2004}, i.e.~$k$-d
trees \cite{Bentley1975} or octrees \cite{Meagher1982} are used to
decompose the simulation space.  In Astrophysics in particular,
spatial trees are also a somewhat natural choice as they are used to
compute long-range gravitational interactions via a Barnes-Hut
\cite{Barnes1986} or Fast Multipole \cite{Carrier1988} method.  The
particle interactions are then computed by traversing the list of
particles and searching for their neighbours in the tree.  In current
state-of-the-art simulations (e.g. \cite{Schaye2015}), the gravity
calculation corresponds to roughly one quarter of the calculation time
whilst the hydrodynamics scheme takes approximately half of the
total time. The remainder is spent in the astrophysics modules, which
contain interactions of the same kind as the SPH sums presented
here. Gravity could, however, be vastly sped-up compared to commonly
used software by employing more modern algorithms such as the Fast
Multipole Method \cite{Carrier1988}.

Although such tree traversals are trivial to parallelize, they
have several disadvantages, e.g.~with regards to computational
efficiency, cache efficiency, and exploiting symmetries in the
computation (see \cite{ref:Gonnet2015} for a more detailed
analysis).


%#####################################################################################################

\section{Parallelization strategy}

One of the main concerns when developing \swift was to break
with the branch-and-bound type parallelism inherent to parallel
codes using OpenMP and MPI, and the constant synchronization
between computational steps it results in.

If {\em synchronization} is the main problem, then {\em
  asynchronicity} is the obvious solution.  We therefore opted for a
{\em task-based} approach for maximum single-node, or shared-memory,
performance.  This approach not only provides excellent load-balancing
on a single node, it also provides a powerful model of the computation
that can be used to distribute the work equitably over a set of
distributed-memory nodes using general-purpose graph partitioning
algorithms.  Finally, the necessary communication between nodes can
itself be modelled in a task-based way, interleaving communication
seamlessly with the rest of the computation.


\subsection{Task-based parallelism}

Task-based parallelism is a shared-memory parallel programming
paradigm in which a computation is broken-down in to a set of
{\em tasks} which can be executed  concurrently.
In order to ensure that the tasks are executed in the right
order, e.g.~that data needed by one task is only used once it
has been produced by another task, {\em dependencies} between
tasks are specified and strictly enforced by a task scheduler.
Additionally, if two tasks require exclusive access to the same
resource, yet in no particular order, they are treated as
{\em conflicts} and the scheduler ensures that they are not executed
concurrently.
Computations described in this way then parallelize trivially:
each processor repeatedly grabs a task for which all dependencies
have been satisfied and executes it until there are no tasks left.

The main advantages of using a task-based approach are
%
\begin{itemize}
    \item The order in which the tasks are processed, and how they
        are assigned to each processor is completely
        dynamic and adapts automatically to load imbalances.
    \item If the dependencies and conflicts are specified correctly,
        there is no need for expensive explicit locking, synchronization,
        or atomic operations to deal with most concurrency problems.
    \item Each task has exclusive access to the data it is working on,
        thus improving cache locality and efficiency.
\end{itemize}
%
Task-based parallelism is not a particularly new concept and therefore
several implementations thereof exist, e.g.~Cilk \cite{ref:Blumofe1995},
QUARK \cite{ref:QUARK}, StarPU \cite{ref:Augonnet2011},
SMP~Superscalar \cite{ref:SMPSuperscalar}, OpenMP~3.0 \cite{ref:Duran2009},
Intel's TBB \cite{ref:Reinders2007}, and, to some extent,
Charm++ \cite{ref:Kale1993}.

Since none of these existing taks-based libraries provided the flexibility
required to experiment with different scheduling and communication
techniques, (\swift is an interdisciplinary effort between
Computer Science and Astrophysics to study not only cosmological
phenomena, but also novel simulation algorithms and parallel computing techniques)
we chose to implement our own task scheduler
in \swift, which has since been back-ported as the general-purpose
\qs task scheduler \cite{gonnet2013quicksched}.
In \qs and \swift, task dependencies are specified explicitly,
as opposed to being implicitly derived from data dependencies,
allowing us to more easily build complex task hierarchies.
This also allowed us to extend the scheduler with the concept of
task conflicts and integrate the asynchronous communication
scheme described further on.

Despite its advantages, and the variety of implementations,
task-based parallelism is rarely used in
practice (notable exceptions include the PLASMA project
\cite{ref:Agullo2009} which uses QUARK/StarPU, and the {\tt deal.II} project
\cite{ref:Bangerth2007} which uses Intel's TBB).
The main problem is that to effectively use task-based parallelism,
most computations need to be completely redesigned to fit the paradigm,
which is usually not an option for large and complex codebases.

Since we were re-implementing \swift from scratch, this was not an issue.
The tree-based neighbour-finding described above was replaced with a more
task-friendly approach as described in \cite{ref:Gonnet2015}, in which
the domain is first decomposed into a grid of {\em cells} of edge length
larger or equal to the largest particle radius.
An initial set of interaction tasks is then defined over all cells and
pairs of neighbouring cells, such that if two particles are close enough to interact,
they are either in the same cell or they span a pair of neighbouring cells.
These initial interaction tasks are then refined by recursively
splitting cells that contain more than a certain number of particles
and replacing tasks that span a pair of split cells with tasks
spanning the neighboring sub-cells.
The resulting refined set of tasks contains all the cells and pairs of cells
over which particle interactions must be computed.

The dependencies between the tasks are set following
equations \eqn{rho}, \eqn{dvdt}, and \eqn{dudt}, i.e.~such that for any cell,
all the tasks computing the particle densities therein must have
completed before the particle forces can be computed, and all the
force computations must have completed before the particle velocities
may be updated.
The task hierarchy is shown in Fig.~\ref{tasks}, where the particles in each
cell are first sorted (round tasks) before the particle densities
are computed (first layer of square tasks).
Ghost tasks (triangles) are used to ensure that all density computations
on a cell of particles have completed before the force evaluation tasks
(second layer of square tasks) execute.
Once all the force tasks on a cell of particles have completed,
the integrator tasks (inverted triangles) update the particle positions 
and velocities.
The decomposition was computed such that each cell contains $\sim 100$ particles,
which leads to tasks of up to a few milliseconds each.

Due to the cache-friendly nature of the task-based computations, 
and their ability to exploit symmetries in the particle interactions,
the task-based approach is already more efficient than the tree-based
neighbour search on a single core, and scales efficiently to all
cores of a shared-memory machine \cite{ref:Gonnet2015}.

\begin{figure}
\centering
\includegraphics[width=\columnwidth]{Figures/Hierarchy3}
\caption{Task hierarchy for the SPH computations in \swift,
  including communication tasks.
  Arrows indicate dependencies,
  i.e.~an arrow from task $A$ to task $B$ indicates that $A$
  depends on $B$. 
  The task color corresponds to the cell or
  cells it operates on, e.g.~the density and force tasks work
  on individual cells or pairs of cells.
  The task hierarchy is shown for three cells: blue, yellow,
  and purple. Although the data for the yellow cell resides on
  Node~2, it is required for some tasks on Node~1, and thus needs
  to be copied over  during
  the computation using {\tt send}/{\tt recv} tasks
  (diamond-shaped). \newline 
  Figure adapted from \cite{ref:Gonnet2015}.
  }
\label{tasks}
\end{figure}  


\subsection{Task-based domain decomposition}

Given a task-based description of a computation, partitioning it over
a fixed number of {\em ranks} (using the MPI terminology)
is relatively straight-forward: we create
a {\em cell hypergraph} in which:
\begin{itemize}
  \item Each {\em node} represents a single cell of particles, and,
  \item each {\em edge} represents the tasks, connecting the
    cells.
\end{itemize}
Since in the particular case of \swift each task references at most
two cells, the cell hypergraph is just a regular {\em cell graph}.

Any partition of the cell graph represents a partition of the
computation, i.e.~the nodes belonging to each partition each belong
to a rank, and the
data belonging to each cell resides on the partition/rank to which
it has been assigned.
Any task spanning cells that belong to the same partition needs only
to be evaluated on that rank/partition, and tasks spanning more than
one partition need to be evaluated on both ranks/partitions.

If we then weight each edge with the computational cost associated with
the tasks, then finding a {\em good} cell distribution reduces to finding a
partition of the cell graph such that the maximum sum of the weight
of all edges within and spanning in a partition is minimal
(see Fig.~\ref{taskgraphcut}).
Since the sum of the weights is directly proportional to the amount
of computation per rank/partition, minimizing the maximum sum
corresponds to minimizing the time spent on the slowest rank.
Computing such a partition is a standard graph problem and several
software libraries which provide good solutions\footnote{Computing
the optimal partition for more than two nodes is considered NP-hard.},
e.g.~METIS \cite{ref:Karypis1998} and Zoltan \cite{devine2002zoltan}, exist.

In \swift, the graph partitioning is computed using the METIS library.
The cost of each task is initially approximated via the
asymptotic cost of the task type and the number of particles involved.
After a task has been executed, it's effective computational cost
is computed and used.

Note that this approach does not explicitly consider any geometric
constraints, or strive to partition the {\em amount} of data equitably.
The only criteria is the computational cost of each partition, for
which the task decomposition provides a convenient model.
We are therefore partitioning the {\em computation}, as opposed
to just the {\em data}.

Note also that the proposed partitioning scheme takes neither the
task hierarchy, nor the size of the data that needs to be exchanged
between partitions/ranks into account.
This approach is therefore only reasonable in situations in which
the task hierarchy is wider than flat, i.e.~the length of the critical
path in the task graph is much smaller than the sum of all tasks,
and in which communication latencies are negligible.

\begin{figure}
\centering
\includegraphics[width=0.8\columnwidth]{Figures/task_graph_cut}
\caption{Illustration of the task-based domain decomposition
  in which the tasks (circles) are {\em hyperedges} that connect one or
  more resources (rectangles). The resources are partitioned
  along the thick dotted line. The blue and orange tasks are
  executed on the respective partitions, whereas the green
  tasks/hyperedges along the cut line are executed on both.
  The cost of this partition is the sum of the green tasks,
  which are computed twice, as well as the cost imbalance
  of the tasks executed in each partition.}
\label{taskgraphcut}
\end{figure}  


\subsection{Asynchronous communications}

Although each particle cell resides on a specific rank, the particle
data will still need to be sent to any neighbouring ranks that have
tasks that depend on this data, e.g.~the the green tasks in
Fig.~\ref{taskgraphcut}.
This communication must happen twice at each time-step: once to send
the particle positions for the density computation, and then again
once the densities have been aggregated locally for the force
computation.

Most distributed-memory codes based on MPI \cite{ref:Snir1998}
separate computation and communication into distinct steps, i.e.~all
the ranks first exchange data, and only when the data exchange is
complete does computation start. Further data exchanges only happen
once computation has finished, and so on.
This approach, although conceptually simple and easy to implement,
has three major drawbacks:
\begin{itemize}
  \item The frequent synchronization points between communication
    and computation exacerbate load imbalances,
  \item the communication phase consists mainly of waiting on
    latencies, during which the node's CPUs usually run idle, and
  \item during the computation phase, the communication network
    is left completely unused, whereas during the communication
    phase, all ranks attempt to use it at the same time.
\end{itemize}

It is for these reasons that in \swift we opted for a fully
{\em dynamic and asynchronous} communication model in which local
data is sent whenever it is ready, data from other ranks is
only acted on once it has arrived, and there is no separation into
communication and computational phases.
In practice this means that no rank will sit idle waiting on
communication if there is any computation that can be done.

This fits in quite naturally within the task-based framework
by modelling communication as just another task type, i.e.~adding
tasks that send and receive particle data between ranks.
For every task that uses data that resides on a different rank,
{\tt send} and {\tt recv} tasks are generated automatically on the source
and destination ranks respectively.
At the destination, the task is made dependent of the {\tt recv}
task, i.e.~the task can only execute once the data has actually
been received.
This is illustrated in Fig.~\ref{tasks}, where data is exchanged across
two ranks for the density and force computations and the extra
dependencies are shown in red.

The communication itself is implemented using the non-blocking
{\tt MPI\_Isend} and {\tt MPI\_Irecv} primitives to initiate
communication, and {\tt MPI\_Test} to check if the communication was
successful and resolve the communication task's dependencies.  In the
task-based scheme, strictly local tasks which do not rely on
communication tasks are executed first.  As data from other ranks
arrive, the corresponding non-local tasks are unlocked and are
executed whenever a thread picks them up.

One direct consequence of this approach is that instead of a single
{\tt send}/{\tt recv} call between each pair of neighbouring ranks,
one such pair is generated for each particle cell.
This type of communication, i.e.~several small messages instead of
one large message, is usually strongly discouraged since the sum of
the latencies for the small messages is usually much larger than
the latency of the single large message.
This, however, is of no concern in \swift since nobody is actively
waiting to receive the messages in order, and the communication
latencies are covered by local computations.
A nice side-effect of this approach is that communication no longer
happens in bursts involving all the ranks at the same time, but
is more or less evenly spread over the entire computation, and is
therefore less demanding of the communication infrastructure.



%#####################################################################################################

\section{Scaling tests}

In this section we present some strong scaling tests of the \swift code on different
architectures for a representative cosmology problem.

\subsection{Simulation setup}

In order to provide a realistic setup, 
the initial particle distributions used in our tests were extracted by
resampling low-redshift outputs of the EAGLE project \cite{Schaye2015}, a
large suite of state-of-the-art cosmological simulations. By selecting outputs
at late times (redshift $z=0.5$), we constructed a simulation setup which is
representative of the most expensive part of these simulations, i.e.~when the
particles are highly-clustered and no longer uniformly distributed. This
distribution of particles is shown on Fig.~\ref{fig:ICs}.
In order to fit our simulation setup into the limited
memory of some of the systems tested, we have randomly down-sampled the particle
count of the output to $800^3=5.12\times10^8$, $600^3=2.16\times10^8$ and
$376^3=5.1\times10^7$ particles with periodic boundary conditions 
respectively. Scripts to generate these initial
conditions are provided with the source code. We then run the \swift code for
100 time-steps and average the wall clock time of these time-steps after having
removed the first and last ones, where disk I/O occurs.

The initial load balancing between nodes is computed in the first steps and
re-computed every 100 time steps, and is therefore not included in the timings.
Particles were exchanged between nodes whenever they strayed too far beyond
the cells in which they originally resided.

\begin{figure}
\centering
\includegraphics[width=\columnwidth]{Figures/cosmoVolume}
\caption{The initial density field computed from the initial particle
  distribution used for our tests. The density $\rho_i$ of the particles spans 8
  orders of magnitude, requiring smoothing lengths $h_i$ changing by a factor of
  almost $1000$ across the simulation volume. \label{fig:ICs}}
\end{figure}  

On all the machines, the code was compiled out of the box,
without any tuning, explicit vectorization, or exploiting any
other specific features of the underlying hardware. 

\subsection{x86 architecture: COSMA-5}

For our first test, we ran \swift on the COSMA-5 DiRAC2 Data Centric
System\footnote{\url{icc.dur.ac.uk/index.php?content=Computing/Cosma}}
located at the University of Durham. The system consists of 420 nodes
with 2 Intel Sandy Bridge-EP Xeon
E5-2670\footnote{\url{http://ark.intel.com/products/64595/Intel-Xeon-Processor-E5-2670-20M-Cache-2_60-GHz-8_00-GTs-Intel-QPI}}
CPUs clocked at $2.6~\rm{GHz}$ with each $128~\rm{GByte}$ of RAM. The
16-core nodes are connected using a Mellanox FDR10 Infiniband 2:1 blocking
configuration.

This system is similar to many Tier-2 systems available in most universities or
computing facilities. Demonstrating strong scaling on such a machine is
essential to show that the code can be efficiently used even on the type of
commodity hardware available to most researchers.

The code was compiled with the Intel compiler version \textsc{2016.0.1} and
linked to the Intel MPI library version \textsc{5.1.2.150} and METIS library
version \textsc{5.1.0}.

The simulation setup with $376^3$ particles was run on that system using 1 to
256 threads on 1 to 16 nodes, the results of which are shown on
Fig.~\ref{fig:cosma}. For this strong scaling test, we used one MPI rank per node and 16
threads per node (i.e.~one thread per physical core). We also ran on one single
node using up to 32 threads, i.e.~up to one thread per physical and
virtual core. Fig.~\ref{fig:domains} shows the domain decomposition
obtained via the task-graph decomposition algorithm described above for
the 16 node run.

\begin{figure}
\centering
\includegraphics[width=\columnwidth]{Figures/domains}
\caption{The particles for the initial conditions shown on Fig.~\ref{fig:ICs}
  colored according to the node they belong to after a load-balancing call on
  32 nodes. As can be seen, the domain decomposition follows the cells in the mesh
  but is not made of regular cuts. Domains have different shapes and
  sizes. \label{fig:domains}}
\end{figure}  


\begin{figure*}
\centering
\includegraphics[width=\textwidth]{Figures/scalingCosma}
\caption{Strong scaling test on the COSMA-5 machine (see text for hardware
  description). \textit{Left panel:} Code Speed-up. \textit{Right panel:}
  Corresponding parallel efficiency.  Using 16 threads per node (no use of
  hyper-threading) with one MPI rank per node, a good parallel efficiency (60\%)
  is achieved when increasing the thread count from 1 (1 node) to 128 (8 nodes)
  even on this relatively small test case. The dashed line indicates the
  efficiency when running on one single node but using all the physical and
  virtual cores (hyper-threading). As these CPUs only have one FPU per core, the
  benefit from hyper-threading is limited to a 20\% improvement when going
  from 16 cores to 32 hyperthreads.
  \label{fig:cosma}}
\end{figure*}



\subsection{x86 architecture: SuperMUC}

For our next test, we ran \swift on the SuperMUC x86 phase~1 thin
nodes \footnote{\url{https://www.lrz.de/services/compute/supermuc/systemdescription/}}
located at the Leibniz Supercomputing Center in Garching near Munich,
currently ranked 23rd in the Top500 list\footnote{\url{http://www.top500.org/list/2015/11/}}. This
system consists of 9\,216 nodes with 2 Intel Sandy Bridge-EP Xeon E5-2680
8C\footnote{\url{http://ark.intel.com/products/64583/Intel-Xeon-Processor-E5-2680-(20M-Cache-2_70-GHz-8_00-GTs-Intel-QPI)}}
at $2.7~\rm{GHz}$ CPUS. Each 16-core node has $32~\rm{GByte}$ of RAM.
The nodes are split in 18
``islands'' of 512 nodes within which communications are handled via an
Infiniband FDR10 non-blocking Tree. These islands are then connected using a 4:1
Pruned Tree.

This system is similar in nature to the COSMA-5 system used in the previous set
of tests but is much larger, allowing us to test scalability at a much larger
scale.

The code was compiled with the Intel compiler version \textsc{2015.5.223} and
linked to the Intel MPI library version \textsc{5.1.2.150} and METIS library
version \textsc{5.0.2}.

The simulation setup with $800^3$ particles was run using 16 to
2048 nodes (4 islands) and the results of this strong scaling test are shown in
Fig.~\ref{fig:superMUC}. For this test, we used one MPI rank per node and 16
threads per node, i.e.~one thread per physical core.

\begin{figure*}
\centering
\includegraphics[width=\textwidth]{Figures/scalingSuperMUC}
\caption{Strong scaling test on the SuperMUC phase~1 machine (see text
  for hardware description). \textit{Left panel:} Code
  Speed-up. \textit{Right panel:} Corresponding parallel efficiency.
  Using 16 threads per node (no use of hyper-threading) with one MPI rank
  per node, an almost perfect parallel efficiency is achieved when
  increasing the node count from 16 (256 cores) to 2\,048 (32\,768
  cores).
  \label{fig:superMUC}}
\end{figure*}


\subsection{Blue Gene architecture: JUQUEEN}

For our last set of tests, we ran \swift on the JUQUEEN IBM Blue Gene/Q
system\footnote{\url{http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JUQUEEN/Configuration/Configuration_node.html}}
located at the J\"ulich Supercomputing Center,
currently ranked 11th in the Top500 list.
This system consists of
28\,672 nodes with an IBM PowerPC A2 processor running at
$1.6~\rm{GHz}$ and $16~\rm{GByte}$ of RAM each. Of notable interest
is the presence of two floating units per compute core. The system is
composed of 28 racks containing each 1\,024 nodes. The network uses a
5D torus to link all the racks.

This system is larger than the SuperMUC supercomputer described above and
uses a completely different processor and instruction set.
We use it here to demonstrate that our results are not dependant
on the hardware being used.

The code was compiled with the IBM XL compiler version \textsc{30.73.0.13} and
linked to the corresponding MPI and METIS library
versions \textsc{4.0.2}.

The simulation setup with $600^3$ particles was first run using
512 nodes with one MPI rank per node and varying only the number of threads per
node. The results of this test are shown in Fig.~\ref{fig:JUQUEEN1}.

We later repeated the test, this time varying the number of nodes from 32 to
8192 (8 racks).  For this test, we used one MPI rank per node and 32 threads per
node, i.e.~two threads per physical core. The results of this strong scaling
test are shown in Fig.~\ref{fig:JUQUEEN2}.


\begin{figure}
\centering
\includegraphics[width=\columnwidth]{Figures/scalingInNode}
\caption{Strong scaling test of the hybrid component of the code. The
  same calculation is performed on 512 node of the JUQUEEN Blue Gene
  supercomputer (see text for hardware description) using a single MPI
  rank per node and varying only the number of
  threads per node. The code displays excellent scaling even when all the cores and
  hardware multi-threads are in use. \label{fig:JUQUEEN1}}
\end{figure}  



\begin{figure*}
\centering
\includegraphics[width=\textwidth]{Figures/scalingBlueGene}
\caption{Strong scaling test on the JUQUEEN Blue Gene machine (see text
  for hardware description). \textit{Left panel:} Code
  Speed-up. \textit{Right panel:} Corresponding parallel efficiency.
  Using 32 threads per node (2 per physical core) with one MPI rank
  per node, a parallel efficiency of more than $60\%$ is achieved when
  increasing the node count from 32 (512 cores) to 8\,192 (131\,072
  cores). On 8\,192 nodes there are fewer than 27\,000 particles per
  node and only a few hundred tasks, making the whole problem
  extremely hard to load-balance effectively.
  \label{fig:JUQUEEN2}}
\end{figure*}





%#####################################################################################################

\section{Discussion \& conclusions}

The strong scaling results presented in the previous sections on
three different machines demonstrate the ability of our framework
to scale on both small commodity
machines, thanks to the use of task-based parallelism at the node level, and on
the largest machines (Tier-0 systems) currently available, thanks to the
task-based domain distribution and asynchronous communication schemes.
We would like to emphasize that these results were obtained for a
realistic test case without any micro-level optimization or explicit
vectorization.

Excellent strong scaling is also achieved when increasing the number of threads
per node (i.e.~per MPI rank, see fig.~\ref{fig:JUQUEEN1}), demonstrating that
the description of MPI (asynchronous) communications as tasks within our
framework is not a bottleneck. One common conception in HPC is that the number
of MPI communications between nodes should be kept to a minimum to optimize the
efficiency of the calculation. Our approach does exactly the opposite with large
number of point-to-point communications between pairs of nodes occurring over the
course of a time-step. For example, on the SuperMUC machine with 32 nodes (512
cores), each MPI rank contains approximately $1.6\times10^7$ particles in
$2.5\times10^5$ cells. \swift will generate around $58\,000$ point-to-point
asynchronous MPI communications (a pair of \texttt{send} and \texttt{recv} tasks)
{\em per node} and {\em per timestep}. Each one of these communications involves,
on average, no more than 6\,kB of data. Such an insane number of small messages is
discouraged by most practitioners, but seems to works well in practice.
Dispatching communications
over the course of the calculation and not in short bursts, as is commonly done,
may also help lower the load on the network.

One time-step on $8\,192$ nodes of the JUQUEEN machine takes $63~\rm{ms}$ of
wall-clock time. All the loading of the tasks, communications and running of the
tasks takes place in that short amount of time. Our framework can therefore
load-balance a calculation over $2.6\times10^5$ threads with remarkable
efficiency.

We emphasize, as was previously demonstrated in \cite{ref:Gonnet2015}, that \swift
is also much faster than the \gadget code \cite{Springel2005}, the
\emph{de-facto} standard in the field of particle-based cosmological
simulations. The simulation setup that was run on the COSMA-5
system takes $2.9~\rm{s}$ of wall-clock time per time-step on $256$ cores using
\swift whilst the default \gadget code on exactly the same setup with the same
number of cores requires $32~\rm{s}$.
The excellent scaling
performance of \swift allows us to push this number further by simply increasing
the number of cores, whilst \gadget reaches its peak speed (for this problem) at
around 300 cores and stops scaling beyond that. This unprecedented scaling
ability combined with future work on vectorization of the calculations within
each task will hopefully make \swift an important tool for future simulations in
cosmology and help push the entire field to a new level.

These results, which are in no way restricted to astrophysical simulations,
provide a compelling argument for moving away from the traditional
branch-and-bound paradigm of both shared and distributed memory programming
using synchronous MPI and OpenMP.
Although fully asynchronous methods, due to their somewhat anarchic nature,
may seem more difficult to control, they are conceptually simple and easy
to implement\footnote{The \swift source code consists of less than 21\,K lines
of code, of which roughly only one tenth are needed to implement the task-based
parallelism and asynchronous communication.}.
The real cost of using a task-based approach comes from having
to rethink the entire computation to fit the task-based setting. This may,
as in our specific case, lead to completely rethinking the underlying
algorithms and reimplementing a code from scratch.
Given our results to date, however, we do not believe we will regret this
investment.

\swift, its documentation, and the test cases presented in this paper are all
available at the address \web.


%#####################################################################################################

\section{Acknowledgements}
This work would not have been possible without Lydia Heck's help and
expertise. We acknowledge the help of Tom Theuns, James Willis, Bert
Vandenbroucke, Angus Lepper and other contributors to the \swift
code. \\ We thank Heinrich Bockhorst and Stephen Blair-Chappell from {\sc
  intel} as well as Dirk Brommel from the J\"ulich Computing Center
and Nikolay J. Hammer from the Leibniz Rechnenzentrum for their help
at various stages of this project.\\ This work used the DiRAC Data
Centric system at Durham University, operated by the Institute for
Computational Cosmology on behalf of the STFC DiRAC HPC Facility
(\url{www.dirac.ac.uk}). This equipment was funded by BIS National
E-infrastructure capital grant ST/K00042X/1, STFC capital grant
ST/H008519/1, and STFC DiRAC Operations grant ST/K003267/1 and Durham
University. DiRAC is part of the National E-Infrastructure. This work
was supported by the Science and Technology Facilities Council
ST/F001166/1 and the European Research Council under the European
Union's ERC Grant agreements 267291 ``Cosmiway'', and by {\sc intel}
through establishment of the ICC as an {\sc intel} parallel computing
centre (IPCC).

\bibliographystyle{abbrv}
\bibliography{biblio}


\end{document}