Merge branch 'pasc_paper' into 'master'

PASC paper First batch of changes to address reviewer comments (see [the doc](https://docs.google.com/document/d/1Y4dogDkIF8V5qQao6PnewRb9vil7sSzDPj7-L5FxlZM/edit?usp=sharing) for more details). See merge request !103

Merge branch 'pasc_paper' into 'master'
f17c338e · Matthieu Schaller · 3e0e0c69 · f8655faa · f17c338e · f17c338e
Commit f17c338e authored 9 years ago by Matthieu Schaller
--- a/theory/paper_pasc/Figures/Hierarchy3.pdf
+++ b/theory/paper_pasc/Figures/Hierarchy3.pdf
--- a/theory/paper_pasc/Figures/Hierarchy3.svg
+++ b/theory/paper_pasc/Figures/Hierarchy3.svg
@@ -23,14 +23,14 @@
     borderopacity="1.0"
     inkscape:pageopacity="0.0"
     inkscape:pageshadow="2"
-     inkscape:zoom="2"
+     inkscape:zoom="0.93"
     inkscape:cx="335.14507"
-     inkscape:cy="476.92781"
+     inkscape:cy="494.74535"
     inkscape:document-units="px"
     inkscape:current-layer="layer1"
     showgrid="true"
-     inkscape:window-width="1920"
-     inkscape:window-height="1033"
+     inkscape:window-width="1366"
+     inkscape:window-height="721"
     inkscape:window-x="0"
     inkscape:window-y="0"
     inkscape:window-maximized="1">
@@ -1302,5 +1302,27 @@
         x="415.34085"
         id="tspan7331"
         sodipodi:role="line">x</tspan></text>
+    <text
+       xml:space="preserve"
+       style="font-size:12px;font-style:normal;font-weight:bold;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Sans;-inkscape-font-specification:Sans Bold"
+       x="239.18359"
+       y="364.36218"
+       id="text3183"
+       sodipodi:linespacing="125%"><tspan
+         sodipodi:role="line"
+         id="tspan3185"
+         x="239.18359"
+         y="364.36218">Node 1</tspan></text>
+    <text
+       sodipodi:linespacing="125%"
+       id="text3187"
+       y="364.36218"
+       x="499.18359"
+       style="font-size:12px;font-style:normal;font-weight:bold;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Sans;-inkscape-font-specification:Sans Bold"
+       xml:space="preserve"><tspan
+         y="364.36218"
+         x="499.18359"
+         id="tspan3189"
+         sodipodi:role="line">Node 2</tspan></text>
  </g>
 </svg>
--- a/theory/paper_pasc/biblio.bib
+++ b/theory/paper_pasc/biblio.bib
@@ -221,16 +221,6 @@ archivePrefix = "arXiv",
  organization={University of Edinburgh}
 }

-@article{gonnet2015efficient,
-  title={Efficient and Scalable Algorithms for Smoothed Particle Hydrodynamics on Hybrid Shared/Distributed-Memory Architectures},
-  author={Gonnet, Pedro},
-  journal={SIAM Journal on Scientific Computing},
-  volume={37},
-  number={1},
-  pages={C95--C121},
-  year={2015},
-  publisher={SIAM}
-}
   
 @article{ref:Dagum1998,
    title={{OpenMP}: an industry standard {API} for shared-memory programming},
@@ -390,3 +380,16 @@ archivePrefix = "arXiv",
  bibsource = {dblp computer science bibliography, http://dblp.org}
 }

+@inproceedings{ref:Kale1993,
+  author        = "{Kal\'{e}}, L.V. and Krishnan, S.",
+  title         = "{CHARM++: A Portable Concurrent Object Oriented System
+                   Based on C++}",
+  editor        = "Paepcke, A.",
+  fulleditor    = "Paepcke, Andreas",
+  pages         = "91--108",
+  Month         = "September",
+  Year          = "1993",
+  booktitle     = "{Proceedings of OOPSLA'93}",
+  publisher     = "{ACM Press}",
+}
+
--- a/theory/paper_pasc/pasc_paper.tex
+++ b/theory/paper_pasc/pasc_paper.tex
@@ -49,35 +49,42 @@ strong scaling on more than 100\,000 cores.}
  
 \author{
 \alignauthor
-       Matthieu~Schaller\\
-       \affaddr{Institute for Computational Cosmology (ICC)}\\
-       \affaddr{Department of Physics}\\
-       \affaddr{Durham University}\\
-       \affaddr{Durham DH1 3LE, UK}\\
-       \email{\footnotesize \url{matthieu.schaller@durham.ac.uk}}
-\alignauthor
-       Pedro~Gonnet\\
-       \affaddr{School of Engineering and Computing Sciences}\\
-       \affaddr{Durham University}\\
-       \affaddr{Durham DH1 3LE, UK}\\
-\alignauthor
-       Aidan~B.~G.~Chalk\\
-       \affaddr{School of Engineering and Computing Sciences}\\
-       \affaddr{Durham University}\\
-       \affaddr{Durham DH1 3LE, UK}\\
-\and
-\alignauthor
-       Peter~W.~Draper\\
-       \affaddr{Institute for Computational Cosmology (ICC)}\\
-       \affaddr{Department of Physics}\\
-       \affaddr{Durham University}\\
-       \affaddr{Durham DH1 3LE, UK}\\
-       %% \alignauthor
-       %% Tom Theuns\\
-       %% \affaddr{Institute for Computational Cosmology}\\
-       %% \affaddr{Department of Physics}\\
-       %% \affaddr{Durham University}\\
-       %% \affaddr{Durham DH1 3LE, UK}      
+       Main~Author\\
+       \affaddr{Institute}\\
+       \affaddr{Department}\\
+       \affaddr{University}\\
+       \affaddr{City Postal code, Country}\\
+       \email{\footnotesize \url{main.author@university.country}}
+% \alignauthor
+%        Matthieu~Schaller\\
+%        \affaddr{Institute for Computational Cosmology (ICC)}\\
+%        \affaddr{Department of Physics}\\
+%        \affaddr{Durham University}\\
+%        \affaddr{Durham DH1 3LE, UK}\\
+%        \email{\footnotesize \url{matthieu.schaller@durham.ac.uk}}
+% \alignauthor
+%        Pedro~Gonnet\\
+%        \affaddr{School of Engineering and Computing Sciences}\\
+%        \affaddr{Durham University}\\
+%        \affaddr{Durham DH1 3LE, UK}\\
+% \alignauthor
+%        Aidan~B.~G.~Chalk\\
+%        \affaddr{School of Engineering and Computing Sciences}\\
+%        \affaddr{Durham University}\\
+%        \affaddr{Durham DH1 3LE, UK}\\
+% \and
+% \alignauthor
+%        Peter~W.~Draper\\
+%        \affaddr{Institute for Computational Cosmology (ICC)}\\
+%        \affaddr{Department of Physics}\\
+%        \affaddr{Durham University}\\
+%        \affaddr{Durham DH1 3LE, UK}\\
+%        %% \alignauthor
+%        %% Tom Theuns\\
+%        %% \affaddr{Institute for Computational Cosmology}\\
+%        %% \affaddr{Department of Physics}\\
+%        %% \affaddr{Durham University}\\
+%        %% \affaddr{Durham DH1 3LE, UK}      
 }


@@ -180,7 +187,7 @@ The design and implementation of \swift\footnote{
 \swift is an open-source software project and the latest version of
 the source code, along with all the data needed to run the test cased
 presented in this paper, can be downloaded at \web.}
-\cite{gonnet2013swift,theuns2015swift,gonnet2015efficient}, a large-scale
+\cite{gonnet2013swift,theuns2015swift,ref:Gonnet2015}, a large-scale
 cosmological simulation code built from scratch, provided the perfect
 opportunity to test some newer
 approaches, i.e.~task-based parallelism, fully asynchronous communication, and
@@ -189,6 +196,9 @@ graph partition-based domain decompositions.
 This paper describes these techniques, which are not exclusive to
 cosmological simulations or any specific architecture, as well as
 the results obtained with them.
+While \cite{gonnet2013swift,ref:Gonnet2015} already describes the underlying algorithms
+in more detail, in this paper we focus on the parallelization strategy
+and the results obtained  on larger Tier-0 systems.


 %#####################################################################################################
@@ -261,24 +271,30 @@ separately:
        within range of each other and evaluate \eqn{dvdt} and \eqn{dudt}.
 \end{enumerate}

-Finding the interacting neighbours for each particle constitutes
-the bulk of the computation.
-Many codes, e.g. in Astrophysics simulations \cite{Gingold1977},
-rely on spatial {\em trees}
-for neighbour finding \cite{Gingold1977,Hernquist1989,Springel2005,Wadsley2004},
-i.e.~$k$-d trees \cite{Bentley1975} or octrees \cite{Meagher1982}
-are used to decompose the simulation space.
-In Astrophysics in particular, spatial trees are also a somewhat natural
-choice as they are used to compute long-range gravitational interactions
-via a Barnes-Hut \cite{Barnes1986} or Fast Multipole
-\cite{Carrier1988} method. 
-The particle interactions are then computed by traversing the list of
-particles and searching for their neighbours in the tree.
+Finding the interacting neighbours for each particle constitutes the
+bulk of the computation.  Many codes, e.g. in Astrophysics simulations
+\cite{Gingold1977}, rely on spatial {\em trees} for neighbour finding
+\cite{Gingold1977,Hernquist1989,Springel2005,Wadsley2004}, i.e.~$k$-d
+trees \cite{Bentley1975} or octrees \cite{Meagher1982} are used to
+decompose the simulation space.  In Astrophysics in particular,
+spatial trees are also a somewhat natural choice as they are used to
+compute long-range gravitational interactions via a Barnes-Hut
+\cite{Barnes1986} or Fast Multipole \cite{Carrier1988} method.  The
+particle interactions are then computed by traversing the list of
+particles and searching for their neighbours in the tree.  In current
+state-of-the-art simulations (e.g. \cite{Schaye2015}), the gravity
+calculation corresponds to roughly one quarter of the calculation time
+whilst the hydrodynamics scheme takes approximately half of the
+total time. The remainder is spent in the astrophysics modules, which
+contain interactions of the same kind as the SPH sums presented
+here. Gravity could, however, be vastly sped-up compared to commonly
+used software by employing more modern algorithms such as the Fast
+Multipole Method \cite{Carrier1988}.

 Although such tree traversals are trivial to parallelize, they
 have several disadvantages, e.g.~with regards to computational
 efficiency, cache efficiency, and exploiting symmetries in the
-computation (see \cite{gonnet2015efficient} for a more detailed
+computation (see \cite{ref:Gonnet2015} for a more detailed
 analysis).


@@ -337,7 +353,8 @@ Task-based parallelism is not a particularly new concept and therefore
 several implementations thereof exist, e.g.~Cilk \cite{ref:Blumofe1995},
 QUARK \cite{ref:QUARK}, StarPU \cite{ref:Augonnet2011},
 SMP~Superscalar \cite{ref:SMPSuperscalar}, OpenMP~3.0 \cite{ref:Duran2009},
-and Intel's TBB \cite{ref:Reinders2007}.
+Intel's TBB \cite{ref:Reinders2007}, and, to some extent,
+Charm++ \cite{ref:Kale1993}.

 For convenience, and to make experimenting with different scheduling
 techniques easier, we chose to implement our own task scheduler
@@ -357,7 +374,7 @@ which is usually not an option for large and complex codebases.

 Since we were re-implementing \swift from scratch, this was not an issue.
 The tree-based neighbour-finding described above was replaced with a more
-task-friendly approach as described in \cite{gonnet2015efficient}.
+task-friendly approach as described in \cite{ref:Gonnet2015}.
 Particle interactions are computed within, and between pairs, of
 hierarchical {\em cells} containing one or more particles.
 The dependencies between the tasks are set following
@@ -375,24 +392,30 @@ on a cell of particles have completed before the force evaluation tasks
 Once all the force tasks on a cell of particles have completed,
 the integrator tasks (inverted triangles) update the particle positions 
 and velocities.
+The decomposition was computed such that each cell contains $\sim 100$ particles,
+which leads to tasks of up to a few milliseconds each.

 Due to the cache-friendly nature of the task-based computations, 
 and their ability to exploit symmetries in the particle interactions,
 the task-based approach is already more efficient than the tree-based
 neighbour search on a single core, and scales efficiently to all
-cores of a shared-memory machine \cite{gonnet2015efficient}.
+cores of a shared-memory machine \cite{ref:Gonnet2015}.

 \begin{figure}
 \centering
 \includegraphics[width=\columnwidth]{Figures/Hierarchy3}
 \caption{Task hierarchy for the SPH computations in \swift,
-  including communication tasks. Arrows indicate dependencies,
+  including communication tasks.
+  Arrows indicate dependencies,
  i.e.~an arrow from task $A$ to task $B$ indicates that $A$
-  depends on $B$. The task color corresponds to the cell or
+  depends on $B$. 
+  The task color corresponds to the cell or
  cells it operates on, e.g.~the density and force tasks work
  on individual cells or pairs of cells.
-  The blue cell data is on a separate rank as the yellow and
-  purple cells, and thus its data must be sent across during
+  The task hierarchy is shown for three cells: blue, yellow,
+  and purple. Although the data for the yellow cell resides on
+  Node~2, it is required for some tasks on Node~1, and thus needs
+  to be copied over  during
  the computation using {\tt send}/{\tt recv} tasks (diamond-shaped).}
 \label{tasks}
 \end{figure}  
@@ -432,8 +455,13 @@ corresponds to minimizing the time spent on the slowest rank.
 Computing such a partition is a standard graph problem and several
 software libraries which provide good solutions\footnote{Computing
 the optimal partition for more than two nodes is considered NP-hard.},
-e.g.~METIS \cite{ref:Karypis1998} and Zoltan \cite{devine2002zoltan},
-exist.
+e.g.~METIS \cite{ref:Karypis1998} and Zoltan \cite{devine2002zoltan}, exist.
+
+In \swift, the graph partitioning is computed using the METIS library.
+The cost of each task is initially approximated via the
+asymptotic cost of the task type and the number of particles involved.
+After a task has been executed, it's effective computational cost
+is computed and used.

 Note that this approach does not explicitly consider any geometric
 constraints, or strive to partition the {\em amount} of data equitably.
@@ -567,6 +595,11 @@ conditions are provided with the source code. We then run the \swift code for
 100 time-steps and average the wall clock time of these time-steps after having
 removed the first and last ones, where disk I/O occurs.

+The initial load balancing between nodes is computed in the first steps and
+re-computed every 100 time steps, and is therefore not included in the timings.
+Particles were exchanged between nodes whenever they strayed too far beyond
+the cells in which they originally resided.
+
 \begin{figure}
 \centering
 \includegraphics[width=\columnwidth]{Figures/cosmoVolume}
@@ -630,8 +663,9 @@ the 16 node run.
  is achieved when increasing the thread count from 1 (1 node) to 128 (8 nodes)
  even on this relatively small test case. The dashed line indicates the
  efficiency when running on one single node but using all the physical and
-  virtual cores (hyper-threading). As these CPUs only have one FPU per core, we
-  see no benefit from hyper-threading.
+  virtual cores (hyper-threading). As these CPUs only have one FPU per core, the
+  benefit from hyper-threading is limited to a 20\% improvement when going
+  from 16 cores to 32 hyperthreads.
  \label{fig:cosma}}
 \end{figure*}

@@ -641,7 +675,8 @@ the 16 node run.

 For our next test, we ran \swift on the SuperMUC x86 phase~1 thin
 nodes \footnote{\url{https://www.lrz.de/services/compute/supermuc/systemdescription/}}
-located at the Leibniz Supercomputing Center in Garching near Munich. This
+located at the Leibniz Supercomputing Center in Garching near Munich,
+currently ranked 23rd in the Top500 list\footnote{\url{http://www.top500.org/list/2015/11/}}. This
 system consists of 9\,216 nodes with 2 Intel Sandy Bridge-EP Xeon E5-2680
 8C\footnote{\url{http://ark.intel.com/products/64583/Intel-Xeon-Processor-E5-2680-(20M-Cache-2_70-GHz-8_00-GTs-Intel-QPI)}}
 at $2.7~\rm{GHz}$ CPUS. Each 16-core node has $32~\rm{GByte}$ of RAM.
@@ -681,7 +716,9 @@ threads per node, i.e.~one thread per physical core.

 For our last set of tests, we ran \swift on the JUQUEEN IBM Blue Gene/Q
 system\footnote{\url{http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JUQUEEN/Configuration/Configuration_node.html}}
-located at the J\"ulich Supercomputing Center. This system consists of
+located at the J\"ulich Supercomputing Center,
+currently ranked 11th in the Top500 list.
+This system consists of
 28\,672 nodes with an IBM PowerPC A2 processor running at
 $1.6~\rm{GHz}$ and $16~\rm{GByte}$ of RAM each. Of notable interest
 is the presence of two floating units per compute core. The system is
@@ -764,8 +801,9 @@ course of a time-step. For example, on the SuperMUC machine with 32 nodes (512
 cores), each MPI rank contains approximately $1.6\times10^7$ particles in
 $2.5\times10^5$ cells. \swift will generate around $58\,000$ point-to-point
 asynchronous MPI communications (a pair of \texttt{send} and \texttt{recv} tasks)
-{\em per node} and {\em per timestep}. Such an insane number of messages is
-discouraged by most practitioners.
+{\em per node} and {\em per timestep}. Each one of these communications involves,
+on average, no more than 6\,kB of data. Such an insane number of small messages is
+discouraged by most practitioners, but seems to works well in practice.
 Dispatching communications
 over the course of the calculation and not in short bursts, as is commonly done,
 may also help lower the load on the network.
@@ -791,6 +829,22 @@ ability combined with future work on vectorization of the calculations within
 each task will hopefully make \swift an important tool for future simulations in
 cosmology and help push the entire field to a new level.

+These results, which are in no way restricted to astrophysical simulations,
+provide a compelling argument for moving away from the traditional
+branch-and-bound paradigm of both shared and distributed memory programming
+using synchronous MPI and OpenMP.
+Although fully asynchronous methods, due to their somewhat anarchic nature,
+may seem more difficult to control, they are conceptually simple and easy
+to implement\footnote{The \swift source code consists of less than 21\,K lines
+of code, of which roughly only one tenth are needed to implement the task-based
+parallelism and asynchronous communication.}.
+The real cost of using a task-based approach comes from having
+to rethink the entire computation to fit the task-based setting. This may,
+as in our specific case, lead to completely rethinking the underlying
+algorithms and reimplementing a code from scratch.
+Given our results to date, however, we do not believe we will regret this
+investment.
+
 \swift, its documentation, and the test cases presented in this paper are all
 available at the address \web.