%% 7th_SPHERIC_paper_template.tex
%% V1.0
%% updated 08/03/2013
%% courtesy of nicolas.grenier@ec-nantes
%% courtesy of pierre.maruzewski@epfl.ch
%%
%% NOTE: This text file uses MS Windows line feed conventions. When (human)
%% reading this file on other platforms, you may have to use a text
%% editor that can handle lines terminated by the MS Windows line feed
%% characters (0x0D 0x0A).

% Note that the a4paper option is mainly intended so that authors in
% countries using A4 can easily print to A4 and see how their papers will
% look in print. Authors are encouraged to use U.S.
%
% Also note that the "draftcls" or "draftclsnofoot", not "draft", option
% should be used if it is desired that the figures are to be displayed in
% draft mode.
%
% This paper can be formatted using the peerreviewca
% (instead of conference) mode.
\documentclass[a4paper,conference]{IEEEtran}

% Dimensions of margins (DO NOT MODIFY THEM)
\setlength{\textheight}    {23.4cm}%
\setlength{\topmargin}     {-0.8cm}%
\setlength{\headheight}    {0.6cm}%
\setlength{\headsep}       {0.9cm}%

% some very useful LaTeX packages include:

\usepackage{cite}      % Written by Donald Arseneau

\usepackage{graphicx}  % Written by David Carlisle and Sebastian Rahtz

\usepackage{subfigure} % Written by Steven Douglas Cochran


% *** Do not adjust lengths that control margins, column widths, etc. ***
% *** Do not use packages that alter fonts (such as pslatex).         ***
% There should be no need to do such things with IEEEtran.cls V1.6 and later.


% correct bad hyphenation here
\hyphenation{op-tical net-works semi-conduc-tor IEEEtran}


\usepackage{fancyheadings}
\pagestyle{fancy}

\lhead{$8^{th}$ international SPHERIC workshop}
\rhead{Trondheim, Norway, June 4-6, 2013}
\cfoot{} % to avoid page numbering

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%  Set options for the listings package
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage{listings}
\lstset{%
    language=C,
    basicstyle=\scriptsize\tt,
    numbers=left,
    numberstyle=\tiny
    }


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%  Some useful commands
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage{amsfonts}
\usepackage{amsmath}
\usepackage{epsfig}
\usepackage{xspace}
\usepackage{flushend}
\newcommand{\oh}[1]
    {\mbox{$ {\mathcal O}( #1 ) $}}
\newcommand{\eqn}[1]
    {(\ref{eqn:#1})}
\newcommand{\fig}[1]
    {Figure~\ref{fig:#1}}
\newcommand{\tabl}[1]
    {Table~\ref{tab:#1}}
\newcommand{\tab}[1]
    {Table~\ref{tab:#1}}
\newcommand{\figs}[1]
    {Figures~\ref{fig:#1}}
\newcommand{\cuda}
    {{\sc cuda}\xspace}
\newcommand{\namd}
    {{\sc namd}\xspace}
\newcommand{\acemd}
    {{\sc acemd}\xspace}
\newcommand{\fenzi}
    {{\sc fenzi}\xspace}
\newcommand{\mdcore}
    {{\tt mdcore}\xspace}
\newcommand{\swift}
    {{\sc swift}\xspace}
\newcommand{\bsf}[1]
    {\textbf{\textsf{#1}}}


\begin{document}

% paper title
\title{{\sc swift}: Fast algorithms for multi-resolution SPH on
multi-core architectures}


% author names and affiliations
\author{
\IEEEauthorblockN{Pedro Gonnet\IEEEauthorrefmark{1}, Matthieu Schaller\IEEEauthorrefmark{2}, Tom Theuns\IEEEauthorrefmark{2}\IEEEauthorrefmark{3}, Aidan B. G. Chalk\IEEEauthorrefmark{1}}
\IEEEauthorblockA{\IEEEauthorrefmark{1}School of Engineering and Computing Sciences,\\
Durham University,\\
DH1 3LE Durham, UK\\ 
\{pedro.gonnet,aidan.chalk\}@durham.ac.uk \\
~\\}
\IEEEauthorblockA{\IEEEauthorrefmark{2}Institute for Computational Cosmology,\\
Durham University, \\
DH1 3LE Durham, UK\\ 
\{matthieu.schaller,tom.theuns\}@durham.ac.uk \\
~\\}
\IEEEauthorblockA{\IEEEauthorrefmark{3}Department of Physics, \\
 University of Antwerp, Groenenborgerlaan 171, \\
B-2020 Antwerp, Belgium}
}


% use only for invited papers
%\specialpapernotice{(Invited Paper)}

% make the title area
\maketitle

\begin{abstract}
This paper describes a novel approach to neighbour-finding in
Smoothed Particle Hydrodynamics (SPH) simulations with large dynamic
range in smoothing length.
This approach is based on hierarchical cell decompositions, sorted interactions,
and a task-based formulation.
It is shown to be faster than traditional tree-based
codes, and to scale better than domain decomposition-based approaches on
shared-memory parallel architectures such as multi-cores.
\end{abstract}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%  Introduction
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}

Since the past few years, due to the physical limitations
on the speed of individual processor cores, instead of
getting {\em faster}, computers are getting {\em more parallel}.
This increase in parallelism comes mainly in the form of
{\em multi-core} computers in which the number cores can be
expected to continue growing exponentially, e.g.~following
Moore's law, much in the same way processor speeds were up until
a few years ago.

The predominant paradigm for parallel
computing is currently distributed-memory parallelism using MPI
(Message Passing Interface) \cite{ref:Snir1998},
in which large simulations are generally
parallelized by means of data decompositions, i.e.~by assigning
each node or core a portion of the data on which to work.
The cores execute the same code
in parallel, each on its own part on the data, intermittently exchanging data.
The amount of {\em computation} local to the node is then proportional
to the amount of data it contains, e.g.~its volume, while
the amount of {\em communication} is proportional to the
amount of computation spanning two or more nodes, e.g.~its
surface.
If the number of nodes increases, or 
smaller problems are considered, the surface-to-volume ratio,
i.e. the ratio of communication to computation,
grows, and the time spent on communication will increasingly
dominate the entire simulation, reducing scaling and parallel
efficiency.

Assuming the individual cores do not get any faster,
this means that small simulations, for which the
maximum degree of parallelism has already been reached, will never
become any faster.
In order to speed up small simulations, or to continue
scaling for large simulations, new approaches on how
computations are parallelized need to be considered.

With the above in mind, we will, in the following,
describe a reformulation of the
underlying algorithms for Smoothed Particle Hydrodynamics (SPH)
simulations which uses asynchronous
task-based shared-memory parallelism to achieve better parallel
scaling and efficiency on multi-core architectures.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%  Algorithms, i.e. decomposition, interactions, and parallelization
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Algorithms}

The interactions in compressible gas dynamics using SPH
are computed in two distinct stages that are
evaluated separately:
\begin{enumerate}
    \item {\em Density} computation: For each particle $p_i$,
        loop over all particles $p_j$ within $h_i$ of $p_i$ and 
        compute the particle densities.
    \item {\em Force} computation: For each particle $p_i$,
        loop over all particles $p_j$
        within $\max\{h_i,h_j\}$ and compute the particle forces.
\end{enumerate}
The identification of these interacting particle pairs
incurs the main computational cost,
as will be shown in the following sections,
and therefore also presents the main challenge in implementing efficient
SPH simulations.


\subsection{Tree-based approach}

In its simplest formulation, all particles in an SPH simulation have
a constant smoothing length $h$.
In such a setup, finding the particles in range of any other particle
is similar to Molecular Dynamics simulations, in which all particles
interact within a constant cutoff radius, and approaches which are used
in the latter, e.g. cell-linked lists
\cite{ref:Allen1989} or Verlet lists \cite{ref:Verlet1967} can be used
\cite{ref:Dominguez2011,ref:Viccione2008}.

The neighbour-finding problem becomes more interesting, or difficult,
in SPH simulations with variable smoothing lengths, i.e.~in which
each particle has its own smoothing length $h_i$, with ranges spawning
up to several orders of magnitude.
In such cases, e.g. in Astrophysics simulations \cite{ref:Gingold1977},
the above-mentioned approaches cease to work efficiently.
Such codes therefore usually rely on {\em trees}
for neighbour finding \cite{ref:Hernquist1989,ref:Springel2005,ref:Wadsley2004},
i.e.~$k$-d trees \cite{ref:Bentley1975} or octrees \cite{ref:Meagher1982}
are used to decompose the simulation space. 
The particle interactions are then computed by traversing the list of
particles and searching for their neighbours in the tree.

Using such trees, it is in principle trivial to parallelize
the neighbour finding and the actual computation on shared-memory
computers,
e.g.~each thread walks the tree for a different particle,
identifies its neighbours and computes the densities and/or
the second derivatives of the physical quantities for that particle.

Despite their simple and elegant formulation, the tree-based
approach to neighbour-finding has three main problems:
\begin{itemize}
    \item Computational efficiency: Finding all neighbours
        of a particle in the tree is, on average, in \oh{\log N},
        and worst-case behavior in \oh{N^{2/3}} \cite{ref:Lee1977},
        i.e.~the cost per particle grows with the
        total number of particles $N$.
    \item Cache efficiency: When searching for the neighbours of a
        given particle, the data of all potential neighbours, which may
        not be contiguous in memory, is traversed separately.
    \item Symmetry: The parallel tree search can not exploit symmetry,
        i.e.~a pair $p_i$ and $p_j$ will always be found twice,
        once when walking the tree for each particle.
\end{itemize}
    
These problems are all inherently linked to the use of
spatial trees, i.e. more specifically their traversal,
for neighbour-finding.
As of here we will proceed differently, using a hierarchical cell
decomposition, and computing the interactions by cell pairs,
avoiding the above-mentioned problems.


\subsection{Spatial decomposition}

If $h_\mathsf{max} := \max_i  h_i $ is the maximum smoothing
length of any particle in the simulation, we start by splitting
the simulation domain into rectangular cells of edge length
larger or equal to $h_\mathsf{max}$.

Given such an initial decomposition, we then generate
a list of cell {\em self-interactions}, which contains all
non-empty cells in the grid.
This list of interactions is then extended by the cell
{\em pair-interactions}, i.e. a list of all non-empty cell pairs
sharing either a face, edge, or corner.
For periodic domains, cell pair-interactions need also be
specified for cell neighbouring each other across
periodic boundaries.
These self- and pair-interactions encode, conceptually at least,
the evaluation of the interactions between all particles
in the same cell, or all interactions between particle pairs
spanning a pair of cells, respectively.

In this first coarse decomposition, if a particle $p_j$
is within range of a particle $p_i$, both will be either
in the same cell, or in neighbouring cells for which a
cell self-interaction or cell pair-interaction has been
specified respectively (see \fig{InitialDecomp}).

In the best of cases, i.e.~if each cell contains
only particles with smoothing length equal to the cell
edge length, if for any particle $p_i$ we inspect each
particle $p_j$ in the same
and neighbouring cells, only roughly 16\% of the $p_j$
will actually be within range of $p_i$ \cite{ref:Gonnet2007}.
and is even worse if the cells contain particles who's smoothing length
is less than the cell edge length.
We therefore recursively refine the cell decomposition,
bisecting each cell along
all spatial dimension if: (a) The cell contains more than
some minimal number of particles, and (b) the smoothing
length of a reasonable fraction of the particles within
the cell is less than half the cell edge length.

After the cells have been split, the cell self-interactions
of each cell can be split up into the self-interaction
of its sub-cells and the pair-interactions between
them.
Likewise, the cell pair-interactions between two cells
that have been split can themselves be split up into
the pair-interactions of the sub-cells spanning the
original pair boundary if, and only if,
all particles in both cells have a smoothing length of
less than half the cell edge length (see \fig{SplitCell}).

If the cells, self-interactions, and pair-interactions are split
in such a way, if two particles are within range of each other,
they will (a) either share a cell for which a cell self-interaction
is defined, or (b) they will be located in two cells which share
a cell pair-interaction.
In order to identify all the particles within range of each other,
it is therefore sufficient to traverse the list of
self-interactions and pair-interactions and inspect the particle
pairs therein.


\begin{figure}
    \centerline{\epsfig{file=figures/InitialDecomp.pdf,width=0.5\columnwidth}}
    
    \caption{Initial spatial decomposition: The simulation volume is divided into cells of
        edge length greater or equal to the largest smoothing length in the
        system. All neighbours of any given particle (small red circle) within
        that particle's smoothing length (large red circle) are guaranteed to lie
        either within that particle's own cell (green) or the directly
        adjacent cells (orange).}
    \label{fig:InitialDecomp}
\end{figure}


\begin{figure}
    \centerline{{\scriptsize\bsf A} \raisebox{-0.85\height}{\epsfig{file=figures/SplitCell.pdf,width=0.44\columnwidth}}}
    
    \vspace{2ex}
    
    \centerline{{\scriptsize\bsf B} \raisebox{-0.85\height}{\epsfig{file=figures/SplitPair.pdf,width=0.8\columnwidth}}}
    
    \caption{({\bsf A}) Large cells can be split, and their
        self-interaction replaced by the self- and pair-interactions
        of their sub-cells.
        ({\bsf B}) If all particles in a pair of interacting cells have a smoothing
        length less or equal to half of the cell edge length, both cells can be
        split, and their pair-interaction replaced by the pair-interactions
        of the neighbouring sub-cells across the interface.}
    \label{fig:SplitCell}
    \label{fig:SplitPair}
\end{figure}


\subsection{Particle interactions}

The interactions between all particle pairs within a given cell,
i.e. the cell's self-interaction, can be computed by means of
a double {\tt for}-loop over the cell's particle array.
The algorithm, in C-like pseudo code, can be written as follows:

\begin{center}\begin{minipage}{0.9\columnwidth}
    \begin{lstlisting}
for ( i = 0 ; i < count-1 ; i++ ) {
  for ( j = i+1 ; j < count ; j++ ) {
    rij = ||parts[i] - parts[j]||.
    if ( rij < h[i] || rij < h[j] ) {
      compute interaction.
      }
    }
  }
    \end{lstlisting}
\end{minipage}\end{center}

\noindent where {\tt count} is the number of particles in the
cell and {\tt parts} and {\tt h} refers to an array of those
particles and their smoothing lengths respectively.

The interactions between all particles in a pair of cells
can be computed similarly, e.g.:
   
\begin{center}\begin{minipage}{0.9\columnwidth}
    \begin{lstlisting}
for ( i = 0 ; i < count_i ; i++ ) {
  for ( j = 0 ; j < count_j ; j++ ) {
    rij = ||parts_i[i] - parts_j[j]||.
    if ( rij < h_i[i] || rij < h_j[j] ) {
      compute interaction.
      }
    }
  }
    \end{lstlisting}
\end{minipage}\end{center}

\noindent where {\tt count\_i} and {\tt count\_j} refer to
the number of particles in each cell and {\tt parts\_i} and
{\tt parts\_j}, and {\tt h\_i} and {\tt h\_j}, refer to the
particles of each cell and their smoothing lengths respectively.

As described in \cite{ref:Gonnet2007}, using this
naive double {\tt for}-loop, for uniform particle
distributions only $33.5\%$, $16.2\%$,
and $3.6\%$ of all particle
pairs between cells sharing a common face, edge, or corner, respectively,
will be within range of each other, leading
to an excessive number of spurious pairwise distance evaluations (line~3).
We will therefore use the sorted cell
interactions described in \cite{ref:Gonnet2007} and \cite{ref:Gonnet2013},
yet with some minor modifications, as
the original algorithm is designed for systems in which the
smoothing lengths of all particles are equal:
We first sort the particles in both cells along the vector joining
the centers of the two cells and then loop over the
parts $p_i$ on the left and interact them with the sorted parts $p_j$
on the right which are within $h_i$ {\em along the cell pair axis}.
The same procedure is repeated for each particle $p_j$ on the
right, interacting with each other particle $p_i$ on the
left, which is within $h_j$, {\em but not within} $h_i$, along
the cell pair axis (see \fig{SortedInteractions}).
The resulting algorithm, in C-like pseudo-code, can be written as follows:
        
\begin{center}\begin{minipage}{0.9\columnwidth}
    \begin{lstlisting}
for ( i = 0 ; i < count_i ; i++ ) {
  for ( jj = 0 ; jj < count_j ; jj++ ) {
    j = ind_j[jj];
    if ( r_i[i] + h_i[i] < r_j[j] )
      break;
    rij = ||parts_i[i] - parts_j[j]||.
    if ( rij < h_i[i] ) {
      compute interaction.
      }
    }
  }
for ( j = 0 ; j < count_j ; j++ ) {
  for ( ii = count_i-1 ; ii >= 0 ; ii-- ) {
    i = ind_i[i];
    if ( r_i[i] < r_j[j] - h_j[j] )
      break;
    rij = ||parts_i[i] - parts_j[j]||.
    if ( rij < h_j[j] && rij > h_i[i] ) {
      compute interaction.
      }
    }
  }
    \end{lstlisting}
\end{minipage}\end{center}
        
\noindent where {\tt r\_i} and {\tt r\_j} contains the position
the particles of both cells along the cell axis, and
{\tt ind\_i} and {\tt ind\_j} contain the particle indices
sorted with respect to these positions respectively.

This may seem like quite a bit of sorting, but
if a cell has been split and its sub-cells have been sorted,
the sorted indices of the higher-level cell can be constructed
by shifting and merging the indices of its eight sub-cells.
This reduces the \oh{n\log{n}} sorting to \oh{n} for merging. 
Furthermore, instead of sorting the particles every time we compute the
pairwise interactions between two cells, we can pre-compute
the sorted indices along the 13 possible cell-pair axes and
store them for each cell, i.e. as is done in \cite{ref:Gonnet2013}.
The sorted indices can also be re-used over several time steps
until the hierarchical cell decomposition needs to be
re-computed.


\begin{figure}
    \centerline{\epsfig{file=figures/SortedInteractions.pdf,width=0.8\columnwidth}}
    
    \caption{Sorted cell pair-interactions. ({\bsf A}) Starting from a pair of
        neighbouring cells, ({\bsf B}) the particles from both cells
        are projected onto the axis joining the centers of the two cells.
        The particles on the left (blue) and right (orange) are
        then sorted in descending and ascending order respectively.
        Each particle on the left is then only interacted with
        the particles on the right within the cutoff radius along the cell axis.
        }
    \label{fig:SortedInteractions}
\end{figure}


\subsection{Parallel implementation}

OpenMP \cite{ref:Dagum1998} is arguably the most well-known
paradigm for shared-memory, or multi-threaded parallelism.
In OpenMP, compiler annotations are used to describe if and when
specific loops or portions of the code can be executed
in parallel.
When such a parallel section, e.g.~a parallel loop, is
encountered, the sections of the loop are split statically
or dynamically over the available threads, each executing
on a single core, and merge at the end of the parallel section.
Unfortunately, this can lead to a lot of inefficient
branch-and-bound type operations, which generally lead to
low performance and bad scaling on even moderate numbers
of cores (see \fig{OMPScaling}).

Furthermore, this form of shared-memory parallelism provides
no implicit mechanism to avoid or handle concurrency problems,
e.g.~two threads attempting to modify the same data at the same time,
or data dependencies between them.
These must be implemented explicitly using either redundancy, barriers,
critical sections, or atomic memory operations, which can further degrade
parallel performance.

\begin{figure}
    \centerline{\epsfig{file=figures/OMPScaling.pdf,width=0.8\columnwidth}}
    
    \caption{Branch-and-bound parallelism as is commonly used in OpenMP.
        The horizontal arrows indicate the program flow over time, and
        branching arrows indicate a parallel section. The dotted vertical
        bars are the synchronization points at the end of each such section.
        Parallel efficiency is lost due to two factors: The grey shaded areas
        along the main horizontal area indicate parts of the program that
        do not execute in parallel and restrict the maximum parallel
        efficiency, e.g.~as described by Amdahl's law, and the red
        areas indicate the difference between the fastest and slowest
        threads in each parallel block, i.e.~the time lost to
        thread synchronization.
        }
    \label{fig:OMPScaling}
\end{figure}


In order to better exploit shared-memory parallelism, 
we have to change the underlying paradigm, i.e. instead
of annotating an essentially serial computation with parallel
bits, it is preferable to describe the whole computation in a way that
is inherently parallelizable.
One such approach is {\em task-based parallelism}, in which the
computation is divided into a number of inter-dependent
computational tasks, which are then dynamically allocated
to a number of processors.
In order to ensure that the tasks are executed in the right
order, e.g.~that data needed by one task is only used once it
has been produced by another task, and that no two tasks
update the same data at the same time, constraints are specified
and enforced by the task scheduler.

Several middle-wares providing such task-based
parallelism exist, e.g.~Cilk \cite{ref:Blumofe1995}, QUARK \cite{ref:QUARK},
StarPU \cite{ref:Augonnet2011}, SMP~Superscalar \cite{ref:SMPSuperscalar},
OpenMP~3.0 \cite{ref:Duran2009}, and Intel's TBB \cite{ref:Reinders2007}.
These implementations allow for the specification of individual tasks
along with their {\em dependencies}, i.e.~hierarchical relationships
specifying which tasks must have completed before another task
can be executed.

We will, however, differ from these approaches in that we introduce the
concept of {\em conflicts} between tasks.
Conflicts occur when two tasks operate on the same data, 
but the order in which these operations must occur is not defined.
In previous task-based models, conflicts can be modeled as dependencies,
yet this introduces an artificial ordering between the tasks
and imposes unnecessary constraints on the task scheduler
(e.g.~mutual and non-mutual interactions in \cite{ref:Ltaief2012}).
These conflicts can be modelled using exclusive {\em locks} on shared
resources, i.e.~a task will only be scheduled if it can obtain an
exclusive lock on, and thus exclusive access to, potentially shared data.

The particle interactions described in the previous subsection
lead to three basic task types:
%
\begin{itemize}
    \item Cell {\em sorting}, in which the particles in a given
        cell are sorted with respect to their position along a
        one-dimensional axis,
    \item Cell {\em self-interaction}, in which all the particles
        of a given cell are interacted with all the other particles
        within the same cell,
    \item Cell {\em pair-interaction}, in which the interactions for
        all particle pairs spanning a pair of cells are computed. 
\end{itemize}
%
The self-interaction and pair-interaction tasks exist in
two flavors, one for the density computation
and one for the force computation.
Each pair-interaction task requires the sorted indices of
the particles in each cell provided by the sorting tasks.
Since the tasks are restricted to operating on the data of a
single cell, or pair of cells, two tasks conflict if they
operate on overlapping sets of cells.
Due to the hierarchical nature of the spatial decomposition,
two tasks also conflict if the cells of one task are sub-cells
of the other.

Since the SPH computations have two phases, i.e.~density and force
computation, we introduce a {\em ghost} task in between for each cell.
This ghost task depends on all the density computations
for a given cell, and, in turn, all force computations involving
that cell depend on its ghost task.
Using this mechanism, we can enforce that all density computations
for a set of particles have completed before we use this
density in the force computations.

The dependencies and conflicts between tasks can thus be
formulated as follows:

\begin{itemize}

    \item A cell sorting task on a cell with sub-cells depends
        on the sorting tasks of all its sub-cells.

    \item A cell pair-interaction depends on the cell sorting
        tasks of both its cells.
        
    \item Cell pair-interaction and cell self-interaction tasks
        operating on overlapping sets of cells or sub-cells
        conflict with each other.
        
    \item The ghost task of each cell depends on all the density cell pair
        interactions and self-interactions which involve the particles
        in that cell.
        
    \item The ghost task of each cell depends on the ghost tasks of
        its sub-cells.
        
    \item The force cell pair-interaction and self-interaction tasks
        depend on the ghost tasks of the cells on which they operate.

\end{itemize}

\noindent These task dependencies are illustrated in \fig{Hierarchy2}.

This task decomposition has significant advantages over the use of
spatial trees:
First of all, the cost of identifying all particles in range of
a given particle does not depend on the total number of
particles.
Furthermore, the particle interactions in each task are computed
symmetrically, i.e.~each particle pair is identified only
once for each interaction type.
The sorted particle indices can be re-used for both the
density and force computation, and over several time-steps,
thus reducing the computational cost even further.
Finally, if the particles are stored grouped by cell, each task
then only involves accessing and modifying a contained
and contiguous regions of memory, thus potentially improving
cache re-use \cite{ref:Fomin2011}.

\begin{figure}
    \centerline{\epsfig{file=figures/Hierarchy2.pdf,width=0.4\columnwidth}}
    
    \caption{Task dependencies and conflicts:
        Arrows indicate the dependencies
        between different task types, i.e.~and arrow from task A to task
        B indicates that A depends on B.
        Horizontal dashed lines between tasks indicate conflicts,
        i.e.~the two tasks can not be executed concurrently.
        Each sort task (circles) depends
        only on the sort tasks of its sub-cells.
        The pair-interactions (rectangles) for the particle
        density computation depend on the sort tasks of the respective cells,
        whereas self-interaction tasks (squares) for the density computation
        do not, as they do not require sorting.
        Self- and pair-Interactions on overlapping cells (same colour)
        conflict with each other.
        The ghost task of each cell (triangles) depends on the self-
        and pair-interaction density tasks.
        The self- and pair-interaction tasks for the force computation,
        finally, depend on the ghost tasks of the respective cells.
        }
    \label{fig:Hierarchy2}
\end{figure}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%  Validation of the algorithms
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Validation}

In the following, we describe how the algorithms shown in the previous section
are implemented and tested against exiting codes on specific test problems.

\subsection{Implementation details}

The algorithms described above are all implemented as part
of \swift (\underline{S}PH \underline{W}ith
\underline{I}nter-dependent \underline{F}ine-grained
\underline{T}asking),
an Open-Source platform for hybrid shared/distributed-memory
SPH simulations\footnote{See http://swiftsim.sourceforge.net/}.

\swift is implemented in C, and can be compiled with the
{\tt gcc} compiler.
Although SIMD-vectorized code, using the {\tt gcc} vector types
and SSE/AVX intrinsics, has been implemented, it was switched
off in the following to allow for a fair comparison against
unvectorized codes.

The code for the task-based parallelism is implemented using
standard {\tt pthread}s \cite{ref:pthreads}, and in some places,
e.g.~in the time-stepper
or task list creation, OpenMP is used.
Each thread is assigned its own task queue, over which the tasks
are distributed evenly in topological order of the dependencies.
If a thread runs out of tasks, it can steal from another task's
queue \cite{ref:Blumofe1999}.
The threads are initialized once at the start of the simulation
and synchronize via a barrier between time steps.


\subsection{Simulation setup}

In order to test their accuracy, efficiency, and scaling,
the algorithms described in the previous section
were tested in \swift using the following three
simulation setups:
%
\begin{itemize}
    \item {\em Sod-shock} \cite{ref:Sod1978}: A cubic periodic domain containing a
        high-density region of
        800\,000 particles with $P_i=1$ and $\rho_i=1$ on one half,
        and a low-density region of 200\,000
        particles with $P_i=0.1795$ and $\rho_i=1/4$ on the other.
    \item {\em Sedov blast}: A cubic grid of $101\times 101\times 101$
        particles at rest with $P_i=10^{-5}$ and $\rho_i=1$, yet with the
        energy of the central 33 particles set to
        $u_i=100/33/m_i$.
    \item {\em Cosmological volume}: Realistic distribution of matter
        in a periodic volume of universe. The simulation consists
        of 1\,841\,127 particles with
        a mix of smoothing lengths spanning three orders of magnitude,
        providing a test-case for neighbour finding and parallel
        scaling in a real-world scenario.
\end{itemize}
%
The first two cases are used both for benchmarking and for
validating the correctness of the code.
In all simulations, the constants $N_{ngb}=48$, $\gamma=5/3$,
$C_{CFL}=0.3$, and $\alpha = 2.0$ (viscosity) were used.
In \swift, cells were split if they contained more than 400
particles and more than 87.5\% of the particles had a smoothing
length less than half the cell edge length.

In all three test cases, results were
computed using a fixed time step and updating the forces on all
particles in each time-step.

The simulation results were compared with Gadget-2 \cite{ref:Springel2005}
in terms of speed and parallel scaling.
Gadget-2 was compiled with the Intel C Compiler version 2013.0.028
using the options \texttt{\small -DUSE\_IRECV -O3 -ip -fp-model fast -ftz
-no-prec-div -mcmodel=medium}.
For the parallel runs, OpenMPI version 1.6.3 was used \cite{ref:Gabriel2004}.

\swift v.~0.1.0 was compiled with the GNU C Compiler version 4.7.2
using the options \texttt{\small -O3 -ffast-math -fstrict-aliasing
-ftree-vectorize -funroll-loops -mmmx -msse -msse2 -msse3 -mssse3
-msse4.1 -msse4.2 -mavx -fopenmp -march=native -pthread}.
Note that although the compiler switches for the SSE and AVX
vector instruction sets were activated, explicit
SIMD-vectorization was not used.

All simulations were run on a $4\times$Intel Xeon E5-4640
32-core machine running at 2.4\,GHz with
CentOS release 6.2 Linux for x86\_64.


\subsection{Results}

The scaling and efficiency of \swift for the simulations described
in the previous subsection are shown in \fig{Results}.
In all three cases, \swift obtains a parallel efficiency of over
75\% on 32 cores.
For the Cosmological volume, it is also more than eight times
faster than Gadget-2.

\fig{Times} shows the total contribution of each task type, as well
as the overheads for task acquisition and time integration
for the Cosmological volume simulation.
Note that efficiency is lost principally in the self and pair 
interactions, which is most probably due to the reduction of
the effective memory bandwidth with the increasing number of cores
used.
The overheads for task acquisition contribute less than 3.5\% to
the total costs.

In order to assess the correctness of the simulation, average radial
density, pressure, and velocity profiles were computed for
the Sedov blast (see \fig{Sedov_density}) and Sod-shock
(see \fig{SodShock_profiles}) simulations and compared with
the analytical results given in \cite{ref:Sedov1959} and
\cite{ref:Sod1978} respectively.
These resutls are comparable to the output produced by
Gadget-2 for the same initial conditions, and sharper
results could be achieved at higher resolutions and/or
different parameterizations.


\begin{figure*}
    \centerline{\epsfig{file=figures/SodShock_scaling.pdf,width=0.8\textwidth}}
    \centerline{\epsfig{file=figures/SedovBlast_scaling.pdf,width=0.8\textwidth}}
    \centerline{\epsfig{file=figures/CosmoVolume_scaling.pdf,width=0.8\textwidth}}
    
    \caption{Parallel scaling and efficiency for the Sod-shock, Sedov blast
        Cosmological volume simulations.
        The numbers in the scaling plots indicate the average milliseconds
        per time step when running on all 32 cores.}
    \label{fig:Results}
\end{figure*}


\begin{figure}
    \centerline{\epsfig{file=figures/CosmoVolume_times.pdf,width=0.9\columnwidth}}
    
    \caption{Total time for each task type, plus overheads for task
        management and time integration.
        The values, in seconds, are summed over all processors.}
    \label{fig:Times}
\end{figure}


\begin{figure}
    \centerline{\epsfig{file=figures/SedovBlast_density.pdf,width=0.9\columnwidth}}
    
    \caption{Average radial density profiles and analytical solutions
        (dotted lines)
        for the Sedov blast simulation at
        times $t=\{0.075, 0.15, 0.275\}$.}
    \label{fig:Sedov_density}
\end{figure}

\begin{figure*}
    \centerline{\epsfig{file=figures/SodShock_profile.pdf,width=0.8\textwidth}}
    
    \caption{Average density, pressure, and velocity in $x$-direction,
        along with the corresponding analytical solutions (dotted lines)
        for the Sod-shock simulations at $t=0.12$.}
    \label{fig:SodShock_profiles}
\end{figure*}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%  Conclusions
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Conclusions}

The results show that \swift is significantly more efficient than
current state-of-the-art tree-based methods.
The improved performance is the result of several features:
%
\begin{itemize}
    \item Cell-based neighbour searching which requires only
        constant time per particle,
    \item Sorted particle interactions which significantly
        reduce the number of spurious pairwise distance
        computations in the neighbour search,
    \item Use of symmetry in computing the particle densities
        and forces, thus requiring each particle pair to be
        found only once,
    \item Better cache efficiency of the cell-based interactions.
\end{itemize}
%
Since no hardware-specific tricks or explicit
SIMD-vectorization were used, although the latter is available
in the code, they can not be made responsible for the
improved efficiency.

In addition to the increased speed, \swift
also shows good scaling, in excess
of 75\% parallel efficiency on 32 cores, for all three simulations.
This is due to the fine-grained load balancing and low number
of synchronization points as a result of the task-based parallelism scheme.

We are currently working to extend \swift for hybrid
shared/distributed-memory parallelism, and to GPUs
\cite{ref:Chalk2013}.
Task-based parallelism will play an important role for
both these developments: In hybrid schemes, asynchronous communication
between distributed-memory nodes can be modelled as simply another
set of tasks with their corresponding dependencies, thus avoiding
excessive synchronization points or latencies.
As has already been shown for molecular dynamics simulations,
the task-based parallel scheme can also easily be extended
for computations on GPUs, using the same set of tasks and 
dependencies.
Work is also ongoing in adding different and better implementations
of particle-based hydrodynamics, i.e.~for cosmological
simulations \cite{ref:Schaller2013}.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%  Acknowledgments
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Acknowledgments}

The authors would like to thank Lydia Heck of the Institute for Computational
Cosmology, Durham University, for having initiated this interdisciplinary
collaboration, and for providing exclusive access to the hardware on which
all benchmarks were run.


% trigger a \newpage just before the given reference
% number - used to balance the columns on the last page
% adjust value as needed - may need to be readjusted if
% the document is modified later
%\IEEEtriggeratref{8}
% The "triggered" command can be changed if desired:
%\IEEEtriggercmd{\enlargethispage{-5in}}

% references section
% NOTE: BibTeX documentation can be easily obtained at:
% http://www.ctan.org/tex-archive/biblio/bibtex/contrib/doc/

% can use a bibliography generated by BibTeX as a .bbl file
% standard IEEE bibliography style from:
% http://www.ctan.org/tex-archive/macros/latex/contrib/supported/IEEEtran/bibtex
\bibliographystyle{IEEEtran.bst}
% argument is your BibTeX string definitions and bibliography database(s)
\bibliography{sph.bib}
%


% that's all folks
\end{document}