%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Document class %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \documentclass[final]{siamltex} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Packages and styles %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \usepackage{graphicx} \usepackage{cite} \usepackage{color} \usepackage{algorithmic} \usepackage{algorithm} \usepackage{epsfig} \usepackage{xspace} \usepackage{amsfonts} \usepackage{amsmath} \usepackage{amssymb} \usepackage{url} \usepackage{listings} \bibliographystyle{siam} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % PLoS stuff %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \topmargin 0.0cm \oddsidemargin 0.5cm \evensidemargin 0.5cm \textwidth 16cm \textheight 21cm \usepackage[labelfont=bf,labelsep=period,justification=raggedright]{caption} \makeatletter \renewcommand{\@biblabel}[1]{\quad#1.} \makeatother \date{} \pagestyle{myheadings} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Some useful commands %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newcommand{\oh}[1] {\mbox{$ {\mathcal O}( #1 ) $}} \newcommand{\eqn}[1] {(\ref{eqn:#1})} \newcommand{\fig}[1] {Figure~\ref{fig:#1}} \newcommand{\tabl}[1] {Table~\ref{tab:#1}} \newcommand{\tab}[1] {Table~\ref{tab:#1}} \newcommand{\figs}[1] {Figures~\ref{fig:#1}} \newcommand{\sect}[1] {Section~\ref{sec:#1}} \newcommand{\cuda} {{\sc cuda}\xspace} \newcommand{\namd} {{\sc namd}\xspace} \newcommand{\acemd} {{\sc acemd}\xspace} \newcommand{\fenzi} {{\sc fenzi}\xspace} \newcommand{\mdcore} {{\tt mdcore}\xspace} \newcommand{\bsf}[1] {\textbf{\textsf{#1}}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Title, author and affiliations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \title{Efficient and Scalable Algorithms for Smoothed Particle Hydrodynamics on Hybrid Shared/Distributed-Memory Architectures} \author{Pedro Gonnet\thanks{School of Engineering and Computing Sciences, Durham University, Durham, United Kingdom ({\tt pedro.gonnet@durham.ac.uk}).}} \begin{document} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Set options for the listings package %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \lstset{% language=C, basicstyle=\small\tt, numbers=left, numberstyle=\tiny } %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Title and Author %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Title must be 150 characters or less \maketitle %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Abstract %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{abstract} This paper describes a new fast and implicitly parallel approach to neighbour-finding in multi-resolution Smoothed Particle Hydrodynamics (SPH) simulations. This new approach is based on hierarchical cell decompositions and sorted interactions, within a task-based formulation. It is shown to be faster than traditional tree-based codes, and to scale better than domain decomposition-based approaches on hybrid shared/distributed-memory parallel architectures, e.g. clusters of multi-cores, achieving a $40\times$ speedup over the Gadget-2 simulation code. \end{abstract} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Metadata %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{keywords} smoothed particle hydrodynamics, simulation, task-based parallelism, multi-cores \end{keywords} \begin{AMS} 15A15, 15A09, 15A23 \end{AMS} \pagestyle{myheadings} \thispagestyle{plain} \markboth{P. GONNET}{EFFICIENT AND SCALABLE ALGORITHMS FOR SPH} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Introduction %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} Since the past few years, due to the physical limitations on the speed of individual processor cores, instead of getting {\em faster}, computers are getting {\em more parallel}. This increase in parallelism comes mainly in the form of {\em multi-core} computers, e.g. single computers which contain more than one computational core sharing a common memory bus. Systems containing up to 64 general-purpose cores are becoming commonplace, and the number of cores can be expected to continue growing exponentially, e.g.~following Moore's law, much in the same way processor speeds were up until a few years ago. This development is not restricted to shared-memory parallel desktop computers, but also affects modern High-Performance Computing (HPC) infrastructure which consist mainly of clusters of multi-cores. Indeed, over the past 5 years, the main factor driving the growth in cluster performance is the use of shared-memory multi-cores, and not necessarily an increase in the total number of nodes/computers used. For the past 15 years, the predominant paradigm for parallel computing has been distributed-memory parallelism using MPI (Message Passing Interface) \cite{ref:Snir1998}, in which large simulations are generally parallelized by means of data decompositions, i.e.~by assigning each node or core a portion of the data on which to work. The cores execute the same code in parallel, each on its own part of the problem, intermittently exchanging data with neighbouring cores. The amount of {\em computation} local to the node is proportional to the amount of data it contains, e.g.~its {\em volume}, while the amount of {\em communication} is proportional to the amount of computation spanning two or more nodes, e.g.~its {\em surface}. For very large computations over a moderate number of nodes, the cost of communication is negligible compared to the cost of computation, thus providing good parallel efficiency. However, if the number of nodes increases, or smaller problems are considered, the surface-to-volume ratio, i.e. the ratio of communication to computation, grows, and the time spent on communication will increasingly dominate the entire simulation, resulting in a loss of scaling and parallel efficiency. % Another problem is that current multi-core computers % differ in one important aspect from traditional clusters or % single-core distributed-memory % systems: Since all cores in a single computer share the same memory % bus and, in most cases, parts of the cache hierarchy, % the effective memory bandwidth per core decreases with % increasing numbers of cores, % making aspects such as memory and cache efficiency % increasingly important. Assuming the individual cores do not get any faster, simulations for which the maximum degree of parallelism has already been reached will {\em never} become any faster (see dashed line in \fig{CosmoVolume}). Ever. The surface-to-volume ratio problem also means that large systems which currently parallelize well, if they do not continue to get larger, will also eventually break down as the number of cores used for their computation increases. In order to speed up small simulations, or to continue scaling for large simulations, new approaches on how computations are parallelized need to be considered. In order to address increasingly larger or more complex problems, high-performance computing software will need to be able to better exploit the aforementioned increase in shared-memory parallelism. Although parallelism and parallel codes are nothing new --- Many large-scale scientific codes, e.g.~the cosmological simulation software Gadget-2, can run concurrently on several thousands of cores --- the exponential growth of parallelism, and shared-memory parallelism in particular, provide some interesting new challenges. In this paper, I will describe both a well-known general formulation for shared memory parallelism, i.e.~task-based parallelism, as well as its application Smoothed Particle Hydrodynamics simulations. The task-based approach is extended by the concept of {\em conflicts} between tasks. The specific algorithms use several ideas from other particle-based simulations. The algorithms are implemented in SWIFT, an Open-Source platform-independent software for hybrid shared/distributed-memory parallel simulations which is shown to perform and scale significantly better than the most popular freely available code in this area. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Short outline of previous algorithms, approaches %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Previous work} In the following, I will give an overview of the underlying equations for SPH computations, and discuss how they are normally implemented in multi-resolution simulation codes. \subsection{Smoothed Particle Hydrodynamics} Smoothed Particle Hydrodynamics \cite{ref:Gingold1977,ref:Price2012} (SPH) uses particles to represent fluids. Each particle $p_i$ has a position $\mathbf x_i$, velocity $\mathbf v_i$, internal energy $u_i$, mass $m_i$, and a smoothing length $h_i$. The particles are used to interpolate any quantity $Q$ at any point in space as a weighted sum over the particles: % \begin{equation} Q(\mathbf r) = \sum_i m_i \frac{Q_i}{\rho_i} W( \|\mathbf r - \mathbf r_i\| , h ) \label{eqn:interp} \end{equation} % where $Q_i$ is the quantity at the $i$th particle, $h$ is the {\em smoothing length}, i.e.~the radius of the sphere within which data will be considered for the interpolation, and $W(r,h)$ is the {\em smoothing kernel} or {\em smoothing function}. Several different forms for $W(r,h)$ exist, each with their own specific benefits and drawbacks. In the following, the most common form consisting of a piecewise cubic polynomial will be used: % \begin{equation*} W(r,h) = \frac{8}{\pi h^3} \left\{ \begin{array}{ll} 1 - 6\left(\frac{r}{h}\right)^2 + 6\left(\frac{r}{h}\right)^3 & 0 \leq \frac{r}{h} \leq \frac{1}{2}, \\ 2\left( 1 - \frac{r}{h} \right)^3 & \frac{1}{2} < \frac{r}{h} \leq 1 \\ 0 & \frac{r}{h} > 1. \end{array}\right. \end{equation*} The particle density $\rho_i$ used in \eqn{interp} is itself computed similarly: % \begin{equation} \rho_i = \sum_{j,~r_{ij} < h_i} m_j W(r_{ij},h_i) \label{eqn:rho} \end{equation} % where $r_{ij} = \|\mathbf{r_i}-\mathbf{r_j}\|$ is the Euclidean distance between particles $p_i$ and $p_j$. In compressible simulations, the smoothing length $h_i$ of each particle is chosen such that the weighted number of neighbours % \begin{equation} N_{ngb} = \frac{4}{3}\pi h_i^3 \sum_j W( r_{ij} , h_i ) \label{eqn:nneigh} \end{equation} % is kept constant to within a given range, e.g.~$\pm 1$. This can be achieved by applying a Newton iteration to solve \eqn{nneigh} for $h_i$, where the required derivative $\partial N_{ngb}/\partial h_i$ is computed alongside \eqn{rho}. Once the densities $\rho_i$ have been computed, the time derivatives of the velocity, internal energy, and smoothing length, which require $\rho_i$, are computed as followed: % \begin{eqnarray} \frac{dv_i}{dt} & = & -\sum_{j,~r_{ij} < \hat{h}_{ij}} m_j \left[ \frac{P_i}{\Omega_i\rho_i^2}\nabla_rW(r_{ij},h_i) + \frac{P_j}{\Omega_j\rho_j^2}\nabla_rW(r_{ij},h_j) \right], \label{eqn:dvdt} \\ \frac{du_i}{dt} & = & \frac{P_i}{\Omega_i\rho_i^2} \sum_{j,~r_{ij} < h_i} m_j(\mathbf v_i - \mathbf v_j) \cdot \nabla_rW(r_{ij},h_i), \label{eqn:dudt} \end{eqnarray} % where $\hat{h}_{ij} = \max\{h_i,h_j\}$, and the particle pressure $P_i=\rho_i u_i (\gamma-1)$ and correction term $\Omega_i=1 + \frac{h_i}{3\rho_i}\frac{\partial \rho}{\partial h}$ are computed on the fly. The polytropic index $\gamma$ is usually set to $\frac{5}{3}$. The computations in \eqn{rho}, \eqn{dvdt}, and \eqn{dudt} involve finding all pairs of particles within range of each other. Any particle $p_j$ is {\em within range} of a particle $p_i$ if the distance between $p_i$ and $p_j$ is smaller or equal to the smoothing distance $h_i$ of $p_i$, e.g.~as is done in \eqn{rho}. Note that since particle smoothing lengths may vary between particles, this association is not symmetric, i.e. $p_j$ may be in range of $p_i$, but $p_i$ not in range of $p_j$. If $r_{ij} < \max\{h_i,h_j\}$, as is required in \eqn{dvdt}, then particles $p_i$ and $p_j$ are within range {\em of each other}. The computation thus proceeds in two distinct stages that are evaluated separately: \begin{enumerate} \item {\em Density} computation: For each particle $p_i$, loop over all particles $p_j$ within range of $p_i$ and evaluate \eqn{rho}. \item {\em Force} computation: For each particle $p_i$, loop over all particles $p_j$ within range of each other and evaluate \eqn{dvdt} and \eqn{dudt}. \end{enumerate} The identification of these interacting particle pairs, as will be shown in the following sections, incurs the main computational cost, and therefore also presents the main challenge in implementing efficient SPH simulations. \subsection{Tree-based approaches} In its simplest formulation, all particles in an SPH simulation have a constant smoothing length $h$. In such a setup, finding the particles in range of any other particle is similar to Molecular Dynamics simulations, in which all particles interact within a constant cutoff radius, and approaches which are used in the latter, e.g. cell-linked lists \cite{ref:Allen1989} or Verlet lists \cite{ref:Verlet1967} or more efficient variants thereof \cite{ref:Gonnet2012,ref:Gonnet2013} can be used. Both approaches are discussed in the context of SPH simulations in \cite{ref:Dominguez2011} and \cite{ref:Viccione2008}. The neighbour-finding problem becomes more interesting, or difficult, in SPH simulations with variable smoothing lengths, i.e.~in which each particle has its own smoothing length $h_i$, with ranges spawning up to several orders of magnitude. In such cases, e.g. in Astrophysics simulations \cite{ref:Gingold1977}, the above-mentioned approaches cease to work efficiently. Such codes therefore usually rely on spatial {\em trees} for neighbour finding \cite{ref:Hernquist1989,ref:Springel2005,ref:Wadsley2004}, i.e.~$k$-d trees \cite{ref:Bentley1975} or octrees \cite{ref:Meagher1982} are used to decompose the simulation space. In Astrophysics in particular, spatial trees are also a somewhat natural choice as they are used to compute long-range gravitational interactions via a Barnes-Hut \cite{ref:Barnes1986} or Fast Multipole \cite{ref:Carrier1988} method. The particle interactions are then computed by traversing the list of particles and searching for their neighbours in the tree. Using such trees, it is in principle trivial to parallelize the neighbour finding and the actual computation on shared-memory computers, e.g.~each thread walks the tree for a different particle, identifies its neighbours and computes its densities and/or the second derivatives of the physical quantities of interest for the time integration. Despite its simple and elegant formulation, the tree-based approach to neighbour-finding has three main problems: \begin{itemize} \item Computational efficiency: The cost of finding all neighbours of any given particle in the tree is, on average, in \oh{\log N}, and has worst-case behavior in \oh{N^{2/3}} \cite{ref:Lee1977}, i.e.~in any case, the computational cost per particle grows with the total number of particles $N$. \item Cache efficiency: When searching for the neighbours of a given particle, the data of all potential neighbours, which may not be contiguous in memory, is traversed. This leads to scattered memory access patterns that may be cache-inefficient. Furthermore, this operation is performed for each particle separately, further reducing the chances of cache re-use. On shared-memory parallel architectures, this problem is of particular concern as parts of the cache hierarchy and the memory bandwidth are shared between cores, effectively reducing both in parallel computations. \item Symmetry: The parallel tree search can not exploit symmetry, i.e.~a pair $p_i$ and $p_j$ will always be found twice, once when walking the tree for each particle. It would, however, be sufficient to find it once and update both particles, as most of the particle interactions are symmetric. If this is done in a shared-memory parallel setup, special care muss be taken to avoid concurrency problems when two threads update the same particle's data. \end{itemize} These problems are all inherently linked to the use of spatial trees, and more specifically their traversal, for neighbour-finding. \subsection{Task-based parallelism} The arguably most well-known paradigm for shared-memory, or multi-threaded parallelism is OpenMP \cite{ref:Dagum1998}, in which compiler annotations are used to describe if and when specific loops or portions of the code can be executed in parallel. When such a parallel section, e.g.~a parallel loop, is encountered, the sections of the loop are split statically or dynamically over the available threads, each executing on a single core. Once all the threads have terminated, the program continues executing in a single thread. Unfortunately, this can lead to a lot of inefficient branch-and-bound type operations, which generally lead to low performance and bad scaling on even moderate numbers of cores (see \fig{OMPScaling}). An additional complication is that this form of shared-memory parallelism provides no implicit mechanism to avoid or handle concurrency problems, e.g.~two threads attempting to modify the same data at the same time, or data dependencies between them. These must be implemented explicitly using either redundancy, barriers, critical sections, or atomic memory operations, which can further degrade parallel performance. \begin{figure} \centerline{\epsfig{file=figures/OMPScaling.pdf,width=0.7\textwidth}} \caption{Branch-and-bound parallelism as is commonly used in OpenMP. The horizontal arrows indicate the program flow over time, and branching arrows indicate a parallel section. The dotted vertical bars are the synchronization points at the end of each such section. Parallel efficiency is lost to two factors: The grey shaded areas along the main horizontal area indicate parts of the program that do not execute in parallel and restrict the maximum parallel efficiency, e.g.~as described by Amdahl's law, and the red areas indicate the difference between the fastest and slowest threads in each parallel block, i.e.~the time lost to individual thread load imbalances and synchronization. } \label{fig:OMPScaling} \end{figure} In order to better exploit shared-memory parallelism, a different paradigm is needed, i.e. instead of annotating an essentially serial computation with parallel bits, it is preferable to describe the entire computation in a way that is inherently parallelizable. One such approach is {\em task-based parallelism}, in which the computation is divided into a number of computational tasks, which are then dynamically allocated to a number of processors. In order to ensure that the tasks are executed in the right order, e.g.~that data needed by one task is only used once it has been produced by another task, and that no two tasks update the same data at the same time, {\em dependencies} between tasks are specified and strictly enforced by a task scheduler. Such a set of tasks and dependencies form a Directed Acyclic Graph (DAG), which the processors can traverse in topological order, picking up and executing tasks when they have no unresolved dependencies. Whenever a processor is done with a task, it goes back to the DAG and looks for a new one or waits until a task becomes available, until all tasks have been completed. \fig{TasksExample} shows five tasks, A, B, C, D, and E, drawn as circles, and their dependencies, depicted as arrows. In this example, tasks B and C depend on task A, and task D depends on task B. In a task-based parallel environment, tasks A and E could be executed first, as they have no unresolved dependencies. Once task A has been executed, tasks B and C become available. Task D can only be executed once task B has completed. \begin{figure} \centerline{\epsfig{file=figures/TasksExample.pdf,width=0.5\textwidth}} \caption{Five tasks, A, B, C, D, and E, with their dependencies shown as arrows, i.e.~tasks B and C both depend on task A.} \label{fig:TasksExample} \end{figure} Several middle-wares providing such task-based parallelism exist, e.g.~Cilk \cite{ref:Blumofe1995}, QUARK \cite{ref:QUARK}, StarPU \cite{ref:Augonnet2011}, SMP~Superscalar \cite{ref:SMPSuperscalar}, OpenMP~3.0 \cite{ref:Duran2009}, and Intel's TBB \cite{ref:Reinders2007}. Cilk and StarPU are implemented as extensions to the C programming language, whereas SMP Superscalar and OpenMP 3.0 use so-called {\tt pragma}s to define functions or sections of code that form tasks. QUARK and Intel's TBB are implemented as compiler-independent libraries which provide functionality for creating and executing tasks. The main advantages of using a task-based approach are % \begin{itemize} \item The order in which the tasks are processed is completely dynamic and adapts automatically to load imbalances. \item If the dependencies and conflicts are specified correctly, there is no need for expensive explicit locking, synchronization, or atomic operations to deal with most concurrency problems. \item Each task has exclusive access to the data it is working on, thus improving cache locality and efficiency. If each task operates exclusively on a restricted part of the problem data, this can lead to high cache locality and efficiency. \end{itemize} % Despite these advantages, task-based parallelism is only rarely used in scientific codes, with the notable exception of the PLASMA project \cite{ref:Agullo2009}, which is the driving force behind the QUARK library, and the {\tt deal.II} project \cite{ref:Bangerth2007} which uses Intel's TBB. The NAMD \cite{ref:Phillips2005} software for Molecular Dynamics simulations uses the Charm++ \cite{ref:Kale1993} library, which provides decomposition of the computation into {\em chares} which react to {\em messages}, in a way similar to tasks. This scarcity of task-based codes is most probably due to the fact that, in order to profit from the many advantages of task-based programming, for most non-trivial problems, the underlying algorithms must be redesigned from the bottom up in a task-based way. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % As of here, new approaches %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Algorithms} In the following, I will describe both an extension to the traditional task-based parallel programming model, as well as a task-based formulation for neighbour-finding and particle interactions in SPH simulation. \subsection{Task-based parallelism with conflicts} I will differ from previous approaches to task-based parallelism in introducing the concept of {\em conflicts} between tasks. Conflicts occur when two tasks operate on the same data, but the order in which these operations must occur is not defined. \fig{TasksExampleConflicts} extends the example in the previous section with conflicts between tasks B and C, and tasks D and E. In a parallel setup, once task A has been completed, if one processor picks up task B, then no other processor is allowed to execute task C until task B has completed, or vice versa. \begin{figure} \centerline{\epsfig{file=figures/TasksExampleConflicts.pdf,width=0.5\textwidth}} \caption{Five tasks, A, B, C, D, and E, with their dependencies shown as arrows, and their conflicts shown as dashed lines, i.e.~tasks B and C, and tasks D and E conflict with each other.} \label{fig:TasksExampleConflicts} \end{figure} In previous task-based models, conflicts can be modeled by adding dependencies between conflicting tasks, yet this introduces an artificial ordering between the tasks and imposes unnecessary constraints on the task scheduler (e.g.~mutual and non-mutual interactions in \cite{ref:Ltaief2012}). In the following, conflicts are modelled using exclusive {\em locks} on shared resources, i.e.~a task operating on potentially shared data will only be scheduled if the executing thread can obtain an exclusive lock on that data, thus preventing other tasks using said data to be scheduled concurrently. This locking mechanism will be described in more detail further on. \subsection{Spatial decomposition} Besides the problems described in the previous section, spatial trees also have the disadvantage that they do not lend themselves particularly well for task-based computations. Therefore, in the following, the particle interactions will be described in terms of {\em hierarchical cell lists}, and the operations thereon. If $h_\mathsf{max} := \max_i h_i $ is the maximum smoothing length of any particle in the simulation, the simulation domain is split into rectangular cells of edge length larger or equal to $h_\mathsf{max}$. Given such an initial decomposition, a list of cell {\em self-interactions}, which contains all non-empty cells in the grid, is generated. This list of interactions is then extended by the cell {\em pair-interactions}, i.e. a list of all non-empty cell pairs sharing either a face, edge, or corner. For periodic domains, cell pair-interactions must also be specified for cells neighbouring each other across periodic boundaries. In this first coarse decomposition, if a particle $p_j$ is within range of a particle $p_i$, both will be either in the same cell, or in neighbouring cells for which a cell self-interaction or cell pair-interaction has been specified respectively. These self- and pair-interactions therefore encode, conceptually at least, the evaluation of the interactions between all particles in the same cell, or all interactions between particle pairs spanning a pair of cells, respectively, i.e.~if the list of self-interactions and pair-interactions are traversed, computing all the interactions within each cell and between each cell pair, respectively, all the required particle interactions will have been computed. In the best case, i.e.~if each cell contains only particles with smoothing length equal to the cell edge length, if for any particle $p_i$ each particle $p_j$ in the same and neighbouring cells is inspected, only roughly 16\% of the $p_j$ will actually be within range of $p_i$ \cite{ref:Gonnet2007}. If the cells contain particles who's smoothing length is less than the cell edge length, this ratio only gets worse. It therefore makes sense to refine the cell decomposition recursively, bisecting each cell along all spatial dimensions whenever (a) the cell contains more than some minimal number of particles, and (b) the smoothing length of a reasonable fraction of the particles within the cell is less than half the cell edge length. A cell will be referred to as {\em split} if it has been divided into {\em sub-cells}. After the cells have been split, the cell self-interactions of each cell can be split up into the self-interaction of its sub-cells and the pair-interactions between them (see \fig{SplitCell}). Likewise, the cell pair-interactions between two cells that have been split can themselves be split up into the pair-interactions of the sub-cells spanning the original pair boundary (see \fig{SplitPair}) if, and only if, all particles in both cells have a smoothing length of less than half the cell edge length. If only one cell within a cell pair-interaction has been split, then the cell pair-interaction is preserved, i.e. the interactions between the particles in both cells are computed, yet the split cell is considered as a whole. If the cells, self-interactions, and pair-interactions are split in such a way, if two particles are within range of each other, they will (a) either share a cell for which a cell self-interaction is defined, or (b) they will be located in two cells which share a cell pair-interaction. In order to identify all the particles within range of each other, it is therefore sufficient to traverse the list of self-interactions and pair-interactions, and to compute the interactions therein. \begin{figure} \centerline{\epsfig{file=figures/InitialDecomp.pdf,height=0.3\textwidth}} \caption{Initial spatial decomposition: The space is divided into cells of edge length greater or equal to the largest smoothing length in the system. All neighbours of any given particle (small red circle) within that particle's smoothing length (large red circle) are guaranteed to lie either within that particle's own cell (green) or the directly adjacent cells (orange).} \label{fig:InitialDecomp} \end{figure} \begin{figure} \centerline{\epsfig{file=figures/SplitCell.pdf,height=0.2\textwidth}} \caption{Large cells can be split, and their self-interaction replaced by the self- and pair-interactions of their sub-cells.} \label{fig:SplitCell} \end{figure} \begin{figure} \centerline{\epsfig{file=figures/SplitPair.pdf,height=0.2\textwidth}} \caption{If all particles in a pair of interacting cells have a smoothing length less or equal to half of the cell edge length, both cells can be split, and their pair-interaction replaced by the pair-interactions of the neighbouring sub-cells across the interface.} \label{fig:SplitPair} \end{figure} \subsection{Particle interactions} The interactions between all particles within the same cell, i.e. the cell's self-interaction, can be computed by means of a double {\tt for}-loop over the cell's particle array. The algorithm, in C-like pseudo code, can be written as follows: \begin{center}\begin{minipage}{0.8\textwidth} \begin{lstlisting} for (i = 0; i < count - 1; i++) { for (j = i + 1; j < count; j++) { rij = dist(parts[i], parts[j]); if (rij < h[i] || rij < h[j]) { compute interaction. } } } \end{lstlisting} \end{minipage}\end{center} \noindent where {\tt count} is the number of particles in the cell and {\tt parts} and {\tt h} refers to an array of those particles' positions and their smoothing lengths respectively. The interactions between all particles in a pair of cells can be computed similarly, e.g.: \begin{center}\begin{minipage}{0.8\textwidth} \begin{lstlisting} for (i = 0; i < count_i; i++) { for (j = 0; j < count_j; j++) { rij = dist(parts_i[i], parts_j[j]); if (rij < h_i[i] || rij < h_j[j]) { compute interaction. } } } \end{lstlisting} \end{minipage}\end{center} \noindent where {\tt count\_i} and {\tt count\_j} refer to the number of particles in each cell and {\tt parts\_i} and {\tt parts\_j}, and {\tt h\_i} and {\tt h\_j}, refer to the particles of each cell and their smoothing lengths respectively. As described in \cite{ref:Gonnet2007}, though, using this naive double {\tt for}-loop, only roughly $33.5\%$, $16.2\%$, and $3.6\%$ of all particle pairs between cells sharing a common face, edge, or corner, respectively, will be within range of each other, leading to an excessive number of spurious pairwise distance evaluations (line~3). The sorted cell interactions described therein will be used in order to avoid this problem, yet with some minor modifications, as the original algorithm is designed for systems in which the smoothing lengths of all particles are equal: the particles in both cells are first sorted along the vector joining the centers of the two cells, then the parts $p_i$ on the left are interacted with the sorted parts $p_j$ on the right which are within $h_i$ {\em along the cell pair axis}. The same procedure is repeated for each particle $p_j$ on the right, interacting with each other particle $p_i$ on the left, which is within $h_j$, {\em but not within} $h_i$, along the cell pair axis (see \fig{SortedInteractions}). The resulting algorithm, in C-like pseudo-code, can be written as follows: \begin{center}\begin{minipage}{0.8\textwidth} \begin{lstlisting} r_i = parts_i projected onto the cell pair axis r_j = parts_j projected onto the cell pair axis ind_i = indices of parts_i sorted w.r.t. r_i in ascending order ind_j = indices of parts_j sorted w.r.t. r_j in ascending order for (i = 0; i < count_i; i++) { for (jj = 0; jj < count_j; jj++) { j = ind_j[jj]; if (r_i[i] + h_i[i] < r_j[j]) break; rij = dist(parts_i[i], parts_j[j]); if (rij < h_i[i]) { compute interaction. } } } for (j = 0; j < count_j; j++) { for (ii = count_i - 1; ii >= 0; ii--) { i = ind_i[i]; if (r_i[i] < r_j[j] - h_j[j]) break; rij = dist(parts_i[i], parts_j[j]); if (rij < h_j[j] && rij > h_i[i]) { compute interaction. } } } \end{lstlisting} \end{minipage}\end{center} \noindent where {\tt r\_i} and {\tt r\_j} contains the position the particles of both cells along the cell axis, and {\tt ind\_i} and {\tt ind\_j} contain the particle indices sorted with respect to these positions respectively. The {\tt if}-statements in lines 9 and 19 are needed to check if there are any particles left within range along the cell pair axis, and to prematurely exit the innermost loop if this is not the case. The particles need to be traversed twice: once to identify all particles in range of the particles on the left (lines~5--14), and once to identify all particles in range of the particles on the right (lines~15--24), but that were not identified in the first loop, thus the condition in line~20. Instead of sorting the particles every time the pairwise interactions between two cells are computed, the sorted indices along the 26 possible cell-pair axes can be pre-computed and stored for each cell. These sorted indices are, however, symmetric: e.g. the indices computed for a cell interacting with a cell to its left along the $x$-axis are the inverse of the indices required for interacting with the cell on its right. It is therefore only necessary to sort 13 sets of indices, and flip the cells in a cell pair-interaction around when the order required is the opposite of the order stored, i.e. as is done in \cite{ref:Gonnet2013}. This may still seem like quite a bit of sorting, especially for the larger, higher-level cells in the simulation. If, however, a cell is split and its sub-cells have been sorted, the sorted indices of the higher-level cells can be constructed by shifting and merging the indices of its eight sub-cells (see \fig{HierarchySorting}). For cells with sub-cells, this reduces the \oh{n\log{n}} cost of sorting to \oh{n} for merging. Note that many codes for Molecular Dynamics and fixed smoothing length SPH simulations compute the particle interactions in two distinct phases: first the neighbours of each particle are identified in stored in a Verlet list (or neighbor lists) \cite{ref:Verlet1967}, and in a second phase the Verlet list is traversed and the actual interactions are computed. The sorted interactions described herein, however, have been shown \cite{ref:Fomin2011} to be more cache-efficient than using Verlet lists, and storing the list of particle neighbors would result in a large memory burden, which is why the latter are not used here. \begin{figure} \centerline{\epsfig{file=figures/SortedInteractions.pdf,width=0.7\textwidth}} \caption{Sorted cell pair-interactions. ({\bsf A}) Starting from a pair of neighbouring cells, ({\bsf B}) the particles from both cells are projected onto the axis joining the centers of the two cells. The particles on the left (blue) and right (orange) are then sorted in descending and ascending order respectively. Each particle on the left is then only interacted with the particles on the right within the cutoff radius along the cell axis. } \label{fig:SortedInteractions} \end{figure} \begin{figure} \centerline{\epsfig{file=figures/HierarchySorting.pdf,width=0.7\textwidth}} \caption{Hierarchical cell sorting of ({\bsf A}) a split cell. ({\bsf B}) The sub-cells are first sorted individually and ({\bsf C}) shifted and merged to produce the sorted list of the parent cell. } \label{fig:HierarchySorting} \end{figure} \subsection{Task-based implementation} The particle interactions described in the previous subsection lead to three basic task types: % \begin{itemize} \item Cell {\em sorting}, in which the particles in a given cell are sorted with respect to their position along the 13 cardinal axes, as described in \cite{ref:Gonnet2013}. \item Cell {\em self-interaction}, in which all the particles of a given cell are interacted with all the other particles within the same cell, \item Cell {\em pair-interaction}, in which the interactions for all particle pairs spanning a pair of cells are computed. \end{itemize} % In order to reduce the total number of tasks, as well as to increase their locality, pair- and self-interactions involving less than a certain number of particles are split only implicitly, i.e.~the tasks resulting from a split are grouped together and executed as a single task. The self-interaction and pair-interaction tasks exist in two flavors, one for the density computation (see \eqn{rho}) and one for the force computation (see \eqn{dvdt}). Each pair-interaction task requires the sorted indices of the particles in each cell provided by the sorting tasks. Since the tasks are restricted to operating on the data of a single cell, or pair of cells, two tasks conflict if they operate on overlapping sets of cells. Due to the hierarchical nature of the spatial decomposition, two tasks also conflict if any of the cells used by one task are sub-cells of any of the cells used by the other task. The interactions have two phases, the~density and force computation, which need to be clearly separated, i.e.~all the density tasks on a cell must complete before its force tasks, which rely on the densities, can be computed. They are therefore separated by a {\em ghost} task for each cell. This ghost task depends on all the density computations for a given cell, and, in turn, all force computations involving that cell depend on its ghost task. This mechanism enforces that all density computations for a set of particles have completed before this density is used in any force computations. Finally, an integrator task for each cell is used to update the particle positions, velocities, and internal energy once the force equations of the particles therein have been evaluated. The dependencies and conflicts between the different task types, which are illustrated in \fig{Hierarchy2}, can be formulated as follows: \begin{itemize} \item Each cell sorting task on a cell with sub-cells depends on the sorting tasks of all its sub-cells. \item Each cell pair-interaction task depends on the cell sorting tasks of both its cells. \item Cell pair-interaction and cell self-interaction tasks operating on overlapping sets of cells or sub-cells conflict with each other. \item The ghost task of each cell depends on all the density cell pair interactions and self-interactions which involve the particles in that cell. \item The ghost task of each cell depends on the ghost tasks of its sub-cells. \item Each force cell pair-interaction or self-interaction task depends on the ghost tasks of the cells on which it operates. \item Each integrator task depends on the force cell pair-interaction and self-interaction tasks of its cell and of all its cell's sub-cells. \end{itemize} This task decomposition has significant advantages over the use of spatial trees. First of all, the cost of identifying all particles in range of a given particle does not depend on the total number of particles, but only on the local particle density. Furthermore, the particle interactions in each task are computed symmetrically, i.e.~each particle pair is identified only once for each interaction type. The sorted particle indices can be re-used for both the density and force computation, and even over several time-steps \cite{ref:Gonnet2013}, thus reducing the computational cost even further. Finally, if the particles are stored grouped by cell, each task then only involves accessing and modifying a limited and contiguous region of memory, thus greatly improving cache re-use \cite{ref:Fomin2011}. \begin{figure} \centerline{\epsfig{file=figures/Hierarchy2.pdf,height=0.5\textwidth}} \caption{Task dependencies and conflicts: Arrows indicate the dependencies between different task types, i.e.~an arrow from task A to task B indicates that A depends on B. Dashed lines between tasks indicate conflicts, i.e.~the two tasks can not be executed concurrently. Each sort task (circles) depends only on the sort tasks of its sub-cells. The pair-interactions (rectangles) for the particle density computation depend on the sort tasks of the respective cells, whereas self-interaction tasks (squares) for the density computation do not, as they do not require sorting. Self- and pair-Interactions on overlapping cells (same color) conflict with each other. The ghost task of each cell (triangles) depends on the self- and pair-interaction density tasks. The self- and pair-interaction tasks for the force computation, depend on the ghost tasks of the respective cells, and the integrator task (inverted triangle) of each cell depends on the cell's force tasks. } \label{fig:Hierarchy2} \end{figure} \subsubsection{Task scheduling} The assignment of individual tasks to the processors of a system is a tricky issue: Each task must be scheduled once all its dependencies are met, only if it has no conflicts, and only to a single processor. Additionally, the tasks should be scheduled in a way that maximizes both the amount of non-conflicting tasks available to all processors, as well as the amount of cache re-use between tasks. In the current implementation, tasks are represented as follows: \begin{center}\begin{minipage}{0.8\textwidth} \begin{lstlisting} struct task { int type; int wait; int nr_unlocks; struct task *unlocks; ... }; \end{lstlisting} \end{minipage}\end{center} \noindent where {\tt type} is determines the task type, e.g.~sorting, pair, or ghost tasks. The variable {\tt wait} is a counter for the number of unresolved dependencies belonging to this task and will be zero when this task is ready to run. Conversely, {\tt unlocks} is an array of {\tt nr\_unlocks} pointers to tasks that depend on this task. The tasks also contain additional data specific to each task type, i.e.~the cells on which it operates. The {\tt wait} counters, which are initialized to zero, are set before the computation by traversing the list of tasks, and, for each task, incrementing by one the {\tt wait} counter of each of its {\tt unlocks} tasks. The {\tt wait} counter then holds the number of other tasks on which the task depends. Once a task has been executed, the {\tt wait} counters of its {\tt unlocks} tasks are decremented by one. If a task's {\tt wait} counter is zero, then it has no unsatisfied dependencies and can be executed. As opposed to the dependencies, the task conflicts are not encoded explicitly, but implemented via locks on shared resources, i.e.~the cells on which a task operates. This locking is described in detail in \sect{locking}. The tasks are managed by a {\em scheduler}, which assigns tasks whose dependencies have all been met to different {\em queues}. The processors, or threads, in a system then obtain the tasks directly from this scheduler, which tries to obtain a task from the thread's preferred queue. If this queue is empty or has no conflict-free tasks, the scheduler attempts to obtain a task from any other non-empty queue. This is, in essence, the well-known concept of {\em work-stealing} described in \cite{ref:Blumofe1999}. In order to preserve memory locality and improve cache re-use, the scheduler attempts to assign tasks working on similar sets of cells to the same queue. This is done by assigning each cell a pre-defined preferred queue. Tasks involving only cells of a given queue are assigned to that queue, and tasks involving more than one queue are assigned to the shortest of the set of queues. While the scheduler is responsible for dependencies and data locality, the queues themselves are responsible for conflict avoidance and task order. The queues are implemented as binary heaps which order the tasks according to their {\em weight}. A task's weight is defined as the approximate or measured computational cost of the task, plus the maximum weight of all its dependent tasks. This weight is a measure for the length of the critical execution path starting at the given task. Picking the task with the largest weight corresponds to reducing the longest critical path of the task DAG first. When a processor requests a task from a queue, the heap is traversed in topological order\footnote{ This is implemented by storing the heap nodes in an array such that the $k$th entry has sub-nodes at the indices $2k+1$ and $2k+2$, and traversing this array from left to right.}, starting from the top, looking for tasks free of conflicts. Although the first task inspected will have the largest weight in the queue, the traversal is not in strictly decreasing weight order. This sub-optimal traversal was chosen as an efficiency trade-off, since the cost of inserting or removing an element in a binary heap is in \oh{\log n}, whereas maintaining a strict ordering requires at least \oh{n} for insertion, e.g.~using linked lists. In order to ensure that each task is assigned only once, mutexes are used to control exclusive access to each queue. The details of the task scheduler are further described in a separate publication \cite{ref:Gonnet2013b}. \subsubsection{Cell locking} \label{sec:locking} Particles within a cell are also within that cell's hierarchical parents. Therefore, when working on the particles of a cell, tasks which operate on its parent's data should not be allowed to execute. One way to avoid this problem is to require that a task not only lock a cell, but also all of its hierarchical parents in order to operate on its data. This, however, would prevent tasks involving siblings who share a common hierarchical parent cell, yet whose particle sets do not overlap, from executing. This problem is avoided by giving each cell both a {\em lock}, and a {\em hold} counter: A cell is {\em locked} when it, or one of its parent cells, is currently in use. A cell is {\em held} when one or more of its sub-cells is locked, and thus cannot be locked itself. Since more than one task at a time may hold a cell, this property is implemented as a counter. The cell locking/holding is implemented as follows: \begin{center}\begin{minipage}{0.8\textwidth} \begin{lstlisting} int cell_locktree(struct cell c) { struct cell *c1, *c2; if (trylock(c->lock) != 0) return 1; if (c->hold > 0) { unlock(c->lock) return 1; } for (c1 = c->parent; c1 != NULL; c1 = c1->parent) { if (trylock(c1->lock) != 0) break; atomic_add(c1->hold, 1); unlock(c1->lock); } if (finger != NULL) { for (c2 = c->parent; c2 != c1; c2 = c2->parent) { atomic_sub(c2->hold, 1); } unlock(c->lock); return 1; } else { return 0; } } \end{lstlisting} \end{minipage}\end{center} \noindent When trying to lock a cell, the function first checks that it is neither locked (line 3) or held (line 4), i.e.~its hold counter is zero. If neither is the case, then the cell can be locked. It then travels up the hierarchy increasing the hold counter of each cell on the way, up to the topmost cell (lines 8--12). If any cell along the hierarchy is locked (line 9), the locking is aborted and all locks and holds are undone (lines 13--18, see \fig{CellLocking}). The operations {\tt atomic\_add} and {\tt atomic\_sub} are understood, respectively, to increase or decrease a value atomically. When the cell is released, its lock is unlocked and the hold counter of all hierarchical parents is decreased by one: \begin{center}\begin{minipage}{0.8\textwidth} \begin{lstlisting} void cell_unlocktree ( struct cell c ) { struct cell *c1; unlock( c->lock ) for ( c1 = c->parent ; c1 != NULL ; c1 = c1->parent ) atomic_sub( c1->hold , 1 ); } \end{lstlisting} \end{minipage}\end{center} \begin{figure} \centerline{\epsfig{file=figures/CellLocking.pdf,width=0.5\textwidth}} \caption{Example of hierarchical cell locking. The cells marked in red are ``locked'' while the cells marked in yellow have a ``hold'' count larger than zero. The hold count is shown inside each cell and corresponds to the number of locked cells hierarchically below it. All cells except for those locked, below a locked cell, or with a ``hold'' count larger than zero can still be locked without violating any constraints. } \label{fig:CellLocking} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Extension of the algorithms to hybrid parallelism %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Hybrid shared/distributed-memory parallelism} Although the task-based algorithms described in the previous subsections were described only in the context of shared-memory parallelism, the task-based scheme extends rather elegantly to {\em hybrid shared/distributed-memory parallel} setups as well. The highest-level cells in the hierarchical cell lists can be distributed between a set of distributed-memory {\em nodes}, such that for each node the space is partitioned into {\em local} and {\em foreign} cells. On each node, the tasks are constructed as in the single-node case, yet each node only keeps the tasks which involve at least one local cell, i.e. self interaction, ghost, and integrator tasks on local cells, and pair interaction tasks involving two local cells, or a local and a foreign cell. This step would seem to imply that each node would have to have information on all particles in the system, but it is actually sufficient for it to know only the structure of the cell hierarchy of the highest-level foreign cells adjacent to a local cell. The set of foreign cells which are involved in a pair interaction with a local cell will be referred to as {\em proxy} cells. These cells are used to compute interactions locally, but contain particles which reside on a different node. For each proxy cell, two {\em communication tasks} are generated, one to receive the particle positions and other data for the density computation, and one to receive the particle densities and other data for the force computation. The force and density tasks involving each proxy cell are made dependent of these two communication tasks respectively. If a proxy cell requires a sort task, the sort task is also made dependent of the communication task for the particle positions (see \fig{Hierarchy3}). Similarly, for each local cell which is a proxy cell on another node, two communication tasks are generated to send the particle positions and densities respectively to the foreign node. The communication task sending a cell's particle positions has no dependencies, but care must be taken to make any task that will update the particle data, i.e.~the cell's ghost task, depend on its completion to avoid corrupting the particle data before it is sent. The communication task sending a cell's particle densities depends on the computation of said densities, i.e.~the cell's ghost task. Since the cell's integrator task will modify the particle data, it must be made dependent on the completion of the communication task sending the cell's particle densities. All communication is implemented asynchronously using the {\tt MPI\_Isend}, {\tt MPI\_Irecv}, and {\tt MPI\_Test} commands. The {\tt MPI\_Isend} and {\tt MPI\_Irecv} commands for a communication task are emitted as soon as the task's dependencies have been met and it is enqueued. Once in the task queue, the communication tasks are only executed if a call to {\tt MPI\_Test} returns that they have indeed completed. The tasks themselves do nothing else, i.e.~all the work happens in the background or on the calls to {\tt MPI\_Test}, depending on how asynchronous communication is implemented in the underlying MPI library. Contrary to most distributed-memory parallel codes, the communication between two nodes is not grouped into a single MPI {\tt send}/{\tt recv} command. Using individual communication tasks has the advantage that the particle data, which is stored contiguously for each cell, does not need to be re-packaged for communication. It also has the advantage of preventing bottlenecks, since a single monolithic sending task would have to wait for {\em all} density tasks involving a different node to complete before it could be executed. The only potential disadvantage is that the sum of the latencies for several small sends and receives is far larger than the latency for a single large send and receive. This, however, should be of no particular concern as the task-based model can just execute other tasks while waiting for the data to be transmitted, effectively masking any communication latencies. The main advantage of using non-blocking communication primitives and a task based model over the traditional synchronous communication in other distributed-memory parallel codes is twofold: Firstly, since the computation is data driven, there are no fixed synchronization points between nodes, which could force all the nodes to wait until the slowest node is ready. Secondly, the communication latencies are completely hidden i.e.~instead of waiting idly for data to be transmitted, the cores of a node can execute any other task, if available, in the meantime. One final advantage associated with using the task-based formulation for hybrid simulations has to do with the domain decomposition, which was until here simply assumed to be given. In most parallel codes, the domain is decomposed following the simulation {\em data}, i.e.~attempting to spread the particles evenly amongst the nodes. The tasks, however, form a complete representation of the {\em work} that will actually be done during the computation, which can be modelled as a graph in which each cell is a node, and two nodes share an edge if they are spanned by a pair-interaction task. The node weights in the graph are given by the sum of the approximate costs of all tasks involving only that node, and the edge weights by the sum of the approximate costs of the tasks spanning both nodes' cells. This graph can then be partitioned using a standard library such as METIS \cite{ref:Karypis1998} providing a good equidistribution of the actual {\em work} across the nodes, as opposed to just the {\em data}. \fig{TaskPlot} shows the task timeline for a single time-step of a hybrid shared/distributed-memory parallel simulation of 1\,M particles on a perturbed grid on eight 12-core nodes of the COSMA4\footnote{\url{http://icc.dur.ac.uk/index.php?content=Computing/Cosma}} cluster. The different colors indicate the different task types. The white gaps preceding the red communication tasks are time spent in the {\tt MPI\_Test} function. Note that despite the large degree of parallelization (96 cores, $<90$\,ms per time step), the load distribution between nodes is good an the load balance within each node is almost perfect, despite the uneven communication costs. It should be noted that the hybrid approach will still suffer from the same surface-to-volume ratio problem for distributed computations described in the introduction. However, as opposed to purely distributed-memory parallel based codes, the problem is somewhat mitigated by the fact that the hybrid approach uses one MPI node per {\em physical node}, as opposed to one MPI node per {\em core}, and that the communication latencies can be overlapped with computations over local cells. With the increasing number of cores per cluster node, this significantly reduces the ratio of communication per computation. \begin{figure} \centerline{\epsfig{file=figures/Hierarchy3.pdf,height=0.5\textwidth}} \caption{Task dependencies and conflicts for hybrid parallelism: Two task hierarchies on neighbouring nodes. The blue cell, which resides on the left node, interacts with the yellow cell, which resides on the right node. The regular task hierarchy of the ``ghost'' yellow cell on the left is replaced with communication tasks (yellow diamonds) and new dependencies (red arrows) on both nodes. On the left node, the sorting and density tasks for the yellow cell depend on the receipt of the particle positions (marked ``{\bf x}''). Similarly, the force tasks depend on the receipt of the particle densities (marked ``$\boldsymbol\rho$''). On the right node, the sending of the particle densities depends on the cell's ghost task to ensure that they have been effectively computed. The ghost and integrator tasks on the right, which overwrite parts of the particle data, are made to depend on the communication tasks having completed. } \label{fig:Hierarchy3} \end{figure} \begin{figure} \centerline{\epsfig{file=figures/PerturbedBox_tasks.pdf,width=0.9\textwidth}} \caption{Task timeline for a single time-step of a hybrid shared/distributed-memory parallel simulation on eight 12-core nodes of the COSMA4 cluster. Each task is represented by a block, the color of which represents the task type, i.e. pair-interaction (blue), self-interaction (green), ghost (yellow), communication (red), and integration (orange).} \label{fig:TaskPlot} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Validation of the algorithms %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Validation} This section describes how the algorithms shown in the previous section are implemented and tested against existing codes on specific problems. \subsection{Implementation details} The algorithms described above are all implemented as part of SWIFT (\underline{S}PH \underline{W}ith \underline{I}nter-dependent \underline{F}ine-grained \underline{T}asking), an Open-Source platform for hybrid shared/distributed-memory SPH simulations\footnote{See http://swiftsim.sourceforge.net/}. The code is being developed in collaboration with the Institute of Computational Cosmology (ICC) at Durham University SWIFT is implemented in C, and can be compiled with the {\tt gcc} compiler. Although explicitly SIMD-vectorized code, using the {\tt gcc} vector types and SSE/AVX intrinsics, has been implemented, it was switched off in the following to allow for a fair comparison with Gadget-2, which does not use explicit vectorization. The underlying multithreading for the task-based parallelism is implemented using standard {\tt pthread}s \cite{ref:pthreads}. Each thread is assigned its own task queue. The threads executing the tasks are initialized once at the start of the simulation and synchronize via a barrier between time steps. The task selection is implemented as described above, yet with the addition that the ID of the last thread to have worked on each cell is recorded and the queue inspects up to 50 valid tasks looking for one that involves previously-used cells. The equations of motion are integrated using a velocity-Verlet integrator. Multiple time-stepping has been implemented similarly to the Gadget-2 code \cite{ref:Springel2005}: The maximum time-step for each particle is computed as % \begin{equation} \Delta t_i = C_{CFL}\frac{2 h_i}{ \max_j\left( c_i + c_j + \max\{0,-3 \mathbf r_{ij} \cdot \mathbf v_{ij} / r_{ij} \} \right) } \label{eqn:dt} \end{equation} % where $\mathbf v_{ij} = \mathbf v_i - \mathbf v_j$, $\mathbf r_{ij} = \mathbf r_i - \mathbf r_j$, and $c_i$ and $c_j$ are the speed of sound of the respective interacting particles. The CFL constant $C_{CFL}$ is usually set to $\sim 0.3$. Given a base time-step $\Delta t$, particles for which $2^{k-1}\Delta t < \Delta t_i \leq 2^k\Delta t$ are only {\em active}, i.e. included in the density and force calculations, every $2^k$th step. Tasks which do not involve any cell with active particles, and are not dependencies of any tasks with active particles, are omitted from the task list in each step. SWIFT also implements the pseudo-Verlet lists described in \cite{ref:Gonnet2013}: The spatial decomposition is computed once and then used over several time-steps until they are invalidated by particle movement. The sorted particle indices for each cell are computed and stored whenever the cells are updated, and re-used over subsequent time-steps. If the cell decomposition does not need to be re-computed often, this can lead to substantial savings by eliminating the sort tasks in these time-steps. Finally, SWIFT also implements an artificial viscosity of the Monaghan--Balsara type \cite{ref:Monaghan1983,ref:Balsara1995}, i.e.~the terms % \begin{eqnarray*} \frac{dv_i}{dt} & = & -\frac{1}{4}\sum_{r_{ij} < \hat{h}_{ij}} m_j \Pi_{ij} \left({\nabla_r}W({r}_{ij}, h_i)+{\nabla_r}W({r}_{ij}, h_j)\right) (f_i+f_j),\\ \frac{du_i}{dt} & = & \frac{1}{8} \sum_{r_{ij} < \hat{h}_{ij}} m_j \Pi_{ij}(\mathbf{v}_i - \mathbf{v}_j) \left({\nabla_r}W({r}_{ij}, h_i)+{\nabla_r}W({r}_{ij}, h_j)\right) (f_i+f_j), \end{eqnarray*} % are added to \eqn{dvdt} and \eqn{dudt} respectively, where % \begin{eqnarray*} \Pi_{ij} &=& -\alpha \frac{\left(c_i + c_j - 3w_{ij}\right)w_{ij}}{\rho_i + \rho_j}, \\ w_{ij} &=& \min\left\{0, \mathbf{v}_{ij}\cdot\mathbf{r}_{ij} / r_{ij}\right\}, \\ f_i &=& \frac{|\nabla \times \mathbf{v}_i|}{|\nabla \cdot \mathbf{v}_i| + |\nabla \times \mathbf{v}_i| + 10^{-4}\frac{c_j}{h_j}}, \\ \nabla \times \mathbf{v}_i &=& -\frac{1}{\rho_i}\sum_j m_j (\mathbf{v}_j - \mathbf{v}_i)\times {\nabla_r}W({r}_{ij}, h_i), \\ \nabla \cdot \mathbf{v}_i &=& \frac{1}{\rho_i}\sum_j m_j (\mathbf{v}_j - \mathbf{v}_i)\cdot {\nabla_r}W({r}_{ij}, h_i). \end{eqnarray*} % and the viscosity parameter $\alpha$ is usually chosen in the range $[0.5,2]$. \subsection{Simulation setup} In order to test their accuracy, efficiency, and scaling, the algorithms described in the previous section were tested in SWIFT using the following three simulation setups: % \begin{itemize} % \item {\em Perturbed box}: A periodic cubic box containing % $50\times 50\times 50$ particles on a regular grid % initialized with $\rho_i=1$, % $P_i=1$ and initial velocities $\mathbf v_i=\mathbf 0$. % All particles have the same initial smoothing lengths $h_i$, % which vary only marginally during the simulation. % The initial particle positions are slightly perturbed % and the system relaxes to a glass-like configuration. % This simulation provides a test case for energy % conservation, and a benchmark for neighbour finding with % constant smoothing lengths. \item {\em Sod-shock} \cite{ref:Sod1978}: A rectangular periodic domain of size $8\times 1 \times 1$ containing a high-density region of 800\,000 particles with $P_i=1$ and $\rho_i=4$ on one half, and a low-density region of 200\,000 particles with $P_i=0.1795$ and $\rho_i=1$ on the other. The simulation results can be compared to an analytic solution, providing a test case for the accuracy of the implementation. \item {\em Sedov blast}: A face-centered cubic lattice of $101\times 101\times 101$ particles at rest with $P_i$ and $\rho_i=1$, yet with the central 26 particles set to $P_i=100$. The resulting blast wave provides a good example of strong pressure, density, and smoothing length gradients for which an analytical solution can be computed. \item {\em Cosmological box}: Realistic distribution of matter in a periodic volume of universe at redshift $z=0.5$. The simulation consists of $\sim$51\,M particles with a mix of smoothing lengths spanning three orders of magnitude, providing a test-case for neighbour finding and parallel scaling in a real-world scenario. Although cosmological simulations are often run with billions of particles \cite{ref:Springel2005b}, the number of particles used is sufficient for the study of a number of interesting phenomena. \end{itemize} % The first two simulations, the Sod-shock and Sedov blast, are idealized problems used mainly to show the correctness of the code, whereas the Cosmological box represents a real-world use case. In all simulations, the constants $N_{ngb}=48$, $\gamma=5/3$, $C_{CFL}=1/4$, and $\alpha = 0.8$ were used. In SWIFT, cells were split if they contained more than 300 particles and more than 87.5\% of the particles had a smoothing length less than half the cell edge length. For cells or cell pairs containing less than 6000 particles, the tasks hierarchically below them were grouped into a single task. The Sod-shock and Sedov blast simulations were run both with and without the sorted particle interaction in order to provide a rough comparison to traditional neighbour-finding approaches in SPH simulations with locally constant smoothing lengths. In all three test cases, results were computed using a fixed time step and updating the forces on all particles in each time-step. In the Sedov blast simulation, the timestep was set to be below the smallest particle timestep as computed in \eqn{dt} in each step. The simulation results compared with Gadget-2 \cite{ref:Springel2005} in terms of speed and parallel scaling. Although the Gadget-2 software was originaly designed for gravity only simulations, is arguably the most efficient and widely-used software for SPH simulations in Astrophysics. Gadget-2 was compiled with the Intel C Compiler version 2013.0.028 using the options \begin{quote}{\tt -DUSE\_IRECV -O3 -ip -fp-model fast -ftz -no-prec-div -mcmodel=medium}.\end{quote} and was run with the gravitational interactions switched off. SWIFT v.~1.0.0 was compiled with the GNU C Compiler version 4.8 using the options \begin{quote}{\tt -O3 -ffast-math -fstrict-aliasing -ftree-vectorize -funroll-loops -mmmx -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -fopenmp -march=native -pthread}.\end{quote} Note that although the compiler switches for the SSE and AVX vector instruction sets were activated, explicit SIMD-vectorization, using vector types and/or intrinsics, was not enabled. For the hybrid shared/distributed memory parallel runs, Platform~MPI version 9.1.0 was used. All simulations were run on the COSMA5 supercomputer,\footnote{\url{http://icc.dur.ac.uk/index.php?content=Computing/Cosma}} consisting of 420 quad-Intel Xeon E5-2670 16-core nodes running at 2.6\,GHz with CentOS release 6.2 Linux for x86\_64. The nodes are connected via Mellanox FDR10 Infiniband in a 2:1 blocking configuration. \subsection{Results} \begin{figure} \centerline{\epsfig{file=figures/SodShock_profile.pdf,width=0.9\textwidth}}% % \centerline{\epsfig{file=SodShock_scaling.pdf,width=0.9\textwidth}} \caption{Results for the Sod-shock simulation test case at $t=0.12$. The density, pressure, and velocity profiles (solid lines) are in good agreement with the analytic solution (dotted lines).} \label{fig:SodShock} \end{figure} \begin{figure} \centerline{\epsfig{file=figures/SedovBlast_density.pdf,width=0.8\textwidth}} \caption{Radial density profile of the Sedov blast simulations at times $t=0.075$ (crosses), $t=0.150$ (circles), and $t=0.275$ (triangles), along with the corresponding analytically computed solution (dotted lines).} \label{fig:SedovBlast} \end{figure} \begin{figure} \centerline{\epsfig{file=figures/SodShock_scaling.pdf,width=0.8\textwidth}} \centerline{\epsfig{file=figures/SedovBlast_scaling.pdf,width=0.8\textwidth}} \caption{Parallel (strong) scaling and efficiency plots for the Sod-shock and Sedov blast simulations on up to 16 cores of a $4\times$Intel Xeon E5-2670 at 2.6\,GHz. The numbers inside the speedup plots on the left are the average number of milliseconds per timestep.} \label{fig:Scaling} \end{figure} \begin{figure} \centerline{\epsfig{file=figures/BigCosmoVolume_cosma5.pdf,width=0.8\textwidth}} \caption{Hybrid parallel (strong) scaling and efficiency for the Cosmological volume simulation on up to 64 nodes of the COSMA5 cluster, containing 16 cores each, for both SWIFT and Gadget-2. The numbers in the scaling plot on the left represent the minimum average milliseconds per timestep achieved. The inverted triangles and numbers in the efficiency plot point to the number of cores used at each point. For both codes, the parallel efficiency is computed relative to the performance of each code on a single core.} \label{fig:CosmoVolume} \end{figure} \fig{SodShock} shows the averaged density, pressure, and velocity profiles along the $x$-axis of the Sod-shock simulation at time $t=0.12$. The computed results are comparable to those produced with Gadget-2 for the same setup and in good agreement with the analytical solution \cite{ref:Sod1978}. Similarly, \fig{SedovBlast} shows the average radial density profiles for the Sedov blast simulation at times $t=0.075$, $t=0.150$, and $t=0.275$, along with their analytical solutions \cite{ref:Sedov1959}. \fig{Scaling} shows the performance of the shared-memory task-based code on a single 16-core node of the COSMA5 cluster. In both cases, SWIFT achieves over 80\% parallel efficiency on all 16 cores. The Sod-shock and Sedov blast simulations without the sorted interactions described in Section~3.3 are roughly $\sim 2-4\times$ slower than the simulation with sorted interactions. It should be noted that these results are for {\em strong scaling} on relatively moderately-sized problems. The scaling plots are also interesting for what they don't show, namely any apparent NUMA-related effects, e.g. performance jumps due to odd numbers of threads per processor. Such effects are usually considered to be a serious problem for shared-memory parallel codes in which several cores share the same memory bus and parts of the cache hierarchy, thus limiting the effective memory bandwidth and higher-level cache capacities \cite{ref:Thacker2006}. The hierarchical cells described herein are sufficiently cache efficient to avoid such problems: By working on sets of particles which are stored contiguously in memory, and that fit well in the lower-level processor caches, the ratio of computation to memory access is kept relatively high, relieving pressure on the higher-level caches and the memory bus. Finally, \fig{CosmoVolume} shows the results of the hybrid shared/distributed-memory parallel Cosmological volume simulation. On 16 cores of the same node, SWIFT achieves 90\% parallel efficiency. As the number of nodes is successively doubled, the parallel efficiency stays above 80\% up to 256 cores (16 nodes), and drops down to 56\% at 1024 cores (64 nodes), at a speedup of a factor of 571$\times$. Gadget-2 does not fare as well, achieving only 40\% parallel efficiency on a single 16-core node and reaching its maximum speedup of 107$\times$ at 384 cores (24 nodes) with a parallel efficiency of only 28\%. As of 384 cores, the parallel efficiency decays rapidly, reaching 4\% at 1024 cores. Over all 1024 cores, SWIFT is 39.6$\times$ faster than Gadget-2. This speedup is due not only to better scaling, but also to the better performance on a single core, where SWIFT is already 7.5$\times$ faster than Gadget-2. Note again that all these results refer to {\em strong scaling}, and that SWIFT was run without explicit vectorization in order to provide a fair comparison. % \begin{itemize} % % \item Ratio of spurious pairwise distance calculations in three-code % vs.~using cells, show actual numbers for each case. % % \end{itemize} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Conclusions %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Conclusions} The good results presented in the previous section can be attributed to a combination of several factors: \begin{itemize} \item {\em Better algorithms}: The cell-based particle interaction algorithms described in Section 3 differ significantly from the widely-used tree-based neighbour finding approaches. The new neighbour finding algorithm's main advantage is its amortized cost of \oh{1} operations per particle, as opposed to \oh{N^{2/3}} (for a total of $N$ particles) for the tree search. The sorted interactions add another factor of $2-4\times$ speedup. These algorithmic improvements alone make SWIFT a total of 7.5$\times$ faster than Gadget-2, as measured with the Cosmological volume simulation, on a single core, i.e.~decoupled from any improvements to the parallel performance of either code. \item {\em Better load balancing on a single node}: The main advantage of the task-based parallel approach with constraints used in SWIFT is that it provides automatic fine-grained load balancing. This, along with the lack of explicit locking, atomics, and/or synchronization, results in more than 80\% parallel efficiency on a single 16-core node. This almost linear scaling translates to a 15-fold advantage over Gadget-2 on a single 16-core node for the Cosmological volume simulation. The small loss of parallel efficiency is due mainly to the remaining small serial bits of the code. \item {\em Better caching behavior}: A subtle additional advantage of the task/cell-based particle interaction algorithm is that the computation is organized in such a way that it maximizes the amount of computation per memory accessed. Instead of interacting a single particle with all its neighbours at potentially disparate memory locations, the cell-based approach computes all interactions between two sets of contiguous particles, allowing them to be kept in the lowest level caches for each task. Furthermore since each task has exclusive access to its particles' data, there is little performance lost to cache coherency maintenance across cores. This can be seen in the strong scaling up to 16 cores of the same machine, and in the total lack of NUMA-related effects. \item {\em Better load balancing across multiple nodes}: Instead of splitting the {\em particles} over a set of distributed-memory nodes, SWIFT uses the task graph to partition the {\em work} over the nodes, resulting in better load balancing and scaling, as shown by the Cosmological volume simulation which has a parallel efficiency of 62\% on 64 nodes, relative to a single node. This is all the more impressive considering that there are less than 800k particles per node and that each time step takes only half a second. \item {\em Hybrid shared/distributed-memory parallelism}: One of the main issues with massively parallel codes is that, as the number of involved cores increases, so does the ratio of communication to computation for each core, resulting in an eventual loss of scaling. In taking a hybrid shared/distributed-memory parallel approach, the number of communicating nodes, and thus the total communication, is reduced by a factor of the number of cores per shared-memory node. \item {\em Asynchronous distributed-memory communication}: The task-based computation lends itself especially well for the implementation of asynchronous communication: Each node sends data as soon as it is ready, and activates tasks which depend on communication only once the data has been received. This avoids any explicit synchronization between nodes to coordinate communication, and, by spreading the communication throughout the computation, reduces pressure on the communication infrastructure. Furthermore, since the idle time between sending and receiving data is used for computation, network latencies play a much less limiting role for the overall performance. \end{itemize} Most of these factors are a direct consequence of the task-based structure of the computation. This simple yet powerful paradigm is at the core of the results presented herein. It should be noted, finally, that the algorithms and performance gains described herein are not the result of exploiting any single feature of the short-lived underlying hardware. All results presented herein were obtained with {\em existing}, commonplace computer hardware. The algorithms merely exploit what has been the trend in computer architecture for the past decade, i.e.~shared-memory parallel multi-core systems, shared hierarchical memory caches, and limited communication bandwidth/latency. These trends still hold true for more modern architectures such as the recent Intel MIC and GPUs, with Intel promoting the use of Cilk on the former \cite{ref:Reinders2012}, and our own work on bringing task-based parallelism to the latter \cite{ref:Chalk2014}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Acknowledgments %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section*{Acknowledgments} The author would like to thank Matthieu Schaller and Tom Theuns of the Institute for Computational Cosmology (ICC), and Aidan Chalk of the School of Engineering and Computing Sciences (SECS), at Durham University, for the ongoing collaboration of which this work is but one of the first results. This collaboration would never have happened were it not for Lydia Heck, also of the ICC at Durham University, who brought the group together and also provided access to and expertise on the COSMA5 cluster. This work used the DiRAC Data Centric system at Durham University, operated by the Institute for Computational Cosmology on behalf of the STFC DiRAC HPC Facility (www.dirac.ac.uk). This equipment was funded by BIS National E-infrastructure capital grant ST/K00042X/1, STFC capital grant ST/H008519/1, and STFC DiRAC Operations grant ST/K003267/1 and Durham University. DiRAC is part of the National E-Infrastructure. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Bibliography %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \nopagebreak \bibliography{sph} \end{document}