From 82026887fe9fc8ab3d74dadfb1f06f20bd7c3804 Mon Sep 17 00:00:00 2001 From: Pedro Gonnet <gonnet@google.com> Date: Tue, 9 Jun 2015 21:56:41 +0200 Subject: [PATCH] started porting to PeerJ template. --- paper/paper.tex | 140 +++++++++++++++++++++++++++++++----------------- 1 file changed, 91 insertions(+), 49 deletions(-) diff --git a/paper/paper.tex b/paper/paper.tex index 03fdc39..7cdf9c3 100644 --- a/paper/paper.tex +++ b/paper/paper.tex @@ -1,4 +1,4 @@ -\documentclass[preprint]{elsarticle} +\documentclass[fleqn,10pt]{wlpeerj} % Package to generate and customize Algorithm as per ACM style \usepackage[ruled]{algorithm2e} @@ -47,18 +47,16 @@ {\textbf{\textsf{#1}}} -% Document starts -\begin{document} - % Title portion \title{QuickSched: Task-based parallelism with dependencies and conflicts} -\author[ecs,ggl]{Pedro Gonnet\corref{cor}} \ead{pedro.gonnet@durham.ac.uk} -\author[ecs]{Aidan B. G. Chalk} \ead{aidan.chalk@durham.ac.uk} -\author[icc]{Matthieu Schaller} \ead{matthieu.schaller@durham.ac.uk} -\cortext[cor]{Corresponding author} -\address[ecs]{School of Engineering and Computing Sciences, Durham University, South Road, Durham DH1 3LE, United Kingdom.} -\address[icc]{Institute for Computational Cosmology, Durham University, South Road, Durham DH1 3LE, United Kingdom.} -\address[ggl]{Google Switzerland GmbH, Brandshenkestr. 110, 8002 Z\"urich, Switzerland.} +\author[1,3]{Pedro Gonnet} +\author[1]{Aidan B. G. Chalk} +\author[2]{Matthieu Schaller} +\affil[1]{School of Engineering and Computing Sciences, Durham University, United Kingdom.} +\affil[2]{Institute for Computational Cosmology, Durham University, United Kingdom.} +\affil[3]{Google Switzerland GmbH, Z\"urich, Switzerland.} + +\keywords{task-based parallelism, shared-memory parallelism, high performance computing, parallel computing} \begin{abstract} This paper describes QuickSched, a compact and efficient Open-Source @@ -73,11 +71,12 @@ shared-memory machine for two example problems: A tiled QR decomposition and a task-based Barnes-Hut tree code. \end{abstract} -\begin{keyword} - task-based parallelism \sep shared-memory parallelism \sep high performance computing \sep parallel computing -\end{keyword} +% Document starts +\begin{document} +\flushbottom \maketitle +\thispagestyle{empty} \section{Introduction} @@ -95,13 +94,20 @@ respectively, of a Directed Acyclic Graph (DAG) which can be traversed in topological order, executing the tasks at the nodes on the way down. -This computational model is trivial to parallelize. -Given a set of inter-dependent tasks and a set of computational +Once modelled in such a way, the computation is somewhat trivial +to parallelize: +given a set of inter-dependent tasks and a set of computational threads, each thread repeatedly selects a task with no unsatisfied dependencies from the DAG and executes it. If no tasks are available, the thread waits until any other thread finishes executing a task, thus potentially releasing new tasks, or until all tasks in the DAG have been executed. +Note that although the parallel execution +itself is trivial, it does not always guaranteed to be efficient. +Several factors may limit the maximum degree of parallelism, e.g.~the +structure of the task dependency DAG itself, or the order in which +available tasks are executed. + \fig{Tasks} shows such a DAG for a set of tasks. The arrows indicate the direction of the dependency, i.e.~an arrow from task $A$ to task $B$ indicates that task $B$ depends @@ -111,6 +117,7 @@ executed concurrently. Once task $G$ has completed, tasks $F$ and $H$ become available and can be executed by any other computational thread. + \begin{figure} \centerline{\epsfig{file=figures/Tasks.pdf,width=0.5\textwidth}} \caption{A set of tasks (circles) and their dependencies (arrows). @@ -126,15 +133,15 @@ and can be executed by any other computational thread. \end{figure} One of the first implementations of a task-based parallel programming -systems is Cilk \cite{ref:Blumofe1995}, an extension to the C +systems is Cilk \citep{ref:Blumofe1995}, an extension to the C programming language which allows function calls to be ``spawned'' as new tasks. Dependencies are enforced by the {\tt sync} keyword, which forces a thread to wait for all the tasks that it spawned to complete. -In SMP superscalar \cite{ref:Perez2008}, StarPU \cite{ref:Augonnet2011}, -QUARK \cite{ref:Yarkhan2011}, and KAAPI \cite{ref:Gautier2007} +In SMP superscalar \citep{ref:Perez2008}, StarPU \citep{ref:Augonnet2011}, +QUARK \citep{ref:Yarkhan2011}, and KAAPI \citep{ref:Gautier2007} the programmer specifies what shared data each task will access, and how that data will be accessed, e.g.~read, write, or read-write access. @@ -145,7 +152,7 @@ the tasks are generated. StarPU also provides an interface for specifying additional dependencies explicitly. Intel's Threading Building Blocks (TBB) -\cite{ref:Reinders2010} +\citep{ref:Reinders2010} provide task-based parallelism using C++ templates in which dependencies are handled either by explicitly waiting for spawned tasks, or by explicitly manipulating @@ -153,8 +160,8 @@ task reference counters. Finally, the very popular OpenMP standard provides some basic support for spawning tasks, similar to Cilk, as of version 3.0 -\cite{ref:OpenMP2008}. -OmpSs \cite{ref:Duran2011} extends this scheme with automatic +\citep{ref:OpenMP2008}. +OmpSs \citep{ref:Duran2011} extends this scheme with automatic dependency generation as in SMP superscalar, of which it is a direct descendant, along with the ability to explicitly wait on certain tasks. @@ -168,7 +175,7 @@ Consider the case of two tasks that update some shared resource in an order-independent way, e.g. when accumulating a result in a shared variable, or exclusively writing to an output file. In order to avoid concurrent access to that resource, it is -imperative that the execution of both tasks do not overlap, +imperative that the execution of both tasks does not overlap, yet the order in which the tasks are executed is irrelevant. In the following, such a relationship will be referred to as a {\em conflict} between two tasks. @@ -241,9 +248,9 @@ From a programmer's perspective, there are two main paradigms for generating task dependencies: \begin{itemize} \item Implicitly via spawning and waiting, e.g.~as is done in Cilk - \cite{ref:Blumofe1995}, or + \citep{ref:Blumofe1995}, or \item Automatic extraction from data dependencies, e.g.~as is done in OmpSs - \cite{ref:Duran2011}. + \citep{ref:Duran2011}. \end{itemize} The first scheme, spawning and waiting, is arguably the simplest to @@ -371,13 +378,15 @@ and granularity such that \end{itemize} \noindent The first critera is biased towards bigger tasks, while the second limits their size. -The criteria are thus used to optimize the -cache efficiency of the computation. +The parameters controlling the size of the tasks in the examples, +i.e.~the tile size in the QR decomposition and the limits $n_\mathsf{max}$ +and $n_\mathsf{task}$ were determined empirically and only optimized +to the closest power of two or rough power of ten, respectively. \section{Data Structures and Algorithms} -The QuickSched task scheduler consist of four main +The QuickSched task scheduler consists of four main objects types: {\em task}, {\em resource}, {\em scheduler}, and {\em queue}. The task and resource objects are used @@ -494,10 +503,18 @@ for the relative computational cost of this task, and the relative cost of the critical path following the dependencies of this task, respectively, i.e.~the task's cost plus the maximum dependent task's weight (see \fig{TaskWeight}). +The task cost can be either a rough estimate provided by the user, +or the actual cost of the same task last time it was executed. The task weights are computed by traversing -the tasks in reverse topological order following their dependencies, i.e. -as per \cite{ref:Kahn1962}, and computing each task's weight -up the task DAG. +the tasks DAG in reverse topological order following their dependencies, +e.g.~as per \cite{ref:Kahn1962}, and computing each task's weight, e.g. +\begin{equation*} + \mbox{weight}_i = \mbox{cost}_i + \max_{j \in \mbox{\small unlocks}_i}\left\{\mbox{weight}_j\right\}. +\end{equation*} +\noindent where $\mbox{weight}_i$ and $\mbox{cost}_i$ are the +weight and cost of the $i$th task, respectively, and +$\mbox{unlocks}_i$ is the set of tasks that the $i$th task +unlocks. \begin{figure} \centerline{\epsfig{file=figures/TaskWeight.pdf,height=0.4\textwidth}} @@ -740,6 +757,15 @@ This type of deadlock, however, is easily avoided by sorting the resources in each task according to some global criteria, e.g.~the resource ID or the address in memory of the resource. +Note also that although protecting the entire queue with a mutex +is not particularly scalable, and several authors, e.g.~REFS, +have presented concurrent data structures that avoid this type +of locking. +However, since we normally use one queue per computational thread, +contention will only happens due to work-stealing, i.e.~when +another idle computational thread tries to poach tasks. +Since this happens only rarely, we opt for the simpler locking approach. + \subsection{Scheduler} The scheduler object is used as the main interface to the @@ -782,11 +808,11 @@ void qsched_run(qsched *s, void (*fun)(int, void *)) { \end{minipage}\end{center} \noindent where {\tt qsched\_start} initializes the tasks and fills the queues (line~1). -For simplicity, OpenMP \cite{ref:Dagum1998}, which is available +For simplicity, OpenMP \citep{ref:Dagum1998}, which is available for most compilers, is used to create a parallel section in which the code between lines~4 and~11 is executed concurrently. -A version using {\tt pthreads} \cite{ref:Pthreads1995} +A version using {\tt pthreads} \citep{ref:Pthreads1995} directly\footnote{In most environments, OpenMP is implemented on top of {\tt pthreads}, e.g. gcc's libgomp.} is also available. The parallel section consists of a loop (lines~7--10) in @@ -864,7 +890,7 @@ i.e.~it loops over all other queues in a random order (line~6) and tries to get a task from them (line~7). If a task could be obtained from any queue and task re-owning -is switched on (line~12), +is enabled (line~12), the resources it locks and uses are marked as now being owned by the preferred queue (lines~13--16). Finally, the task, or {\tt NULL} if no task could be obtained, @@ -886,7 +912,7 @@ how QuickSched can be used in real-world applications, and provides benchmarks to assess its efficiency and scalability. The first test is the tiled QR decomposition originally described in \cite{ref:Buttari2009}, which has been used as a benchmark -by other authors \cite{ref:Agullo2009b,ref:Badia2009,ref:Bosilca2012}. +by other authors \citep{ref:Agullo2009b,ref:Badia2009,ref:Bosilca2012}. This example only requires dependencies and is presented as a point of comparison to existing task-based parallel programming infrastructures. @@ -912,10 +938,10 @@ parallelism for tile-based algorithms in numerical linear algebra, presenting parallel codes for the Cholesky, LU, and QR decompositions. These algorithms are now part of the PLASMA and MAGMA -libraries for parallel linear algebra \cite{ref:Agullo2009}. +libraries for parallel linear algebra \citep{ref:Agullo2009}. The former uses the QUARK task scheduler, which was originally designed for this specific task, while the latter currently uses -the StarPU task scheduler \cite{ref:Agullo2011}. +the StarPU task scheduler \citep{ref:Agullo2011}. \begin{figure} \centerline{\epsfig{file=figures/QR.pdf,width=0.9\textwidth}} @@ -958,16 +984,22 @@ previous level, i.e.~the task $(i,j,k)$ always depends on $(i,j,k-1)$ for $k>1$. Each task also modifies its own tile $(i,j)$, and the DTSQRF task additionally modifies the lower triangular part of the $(j,j)$th tile. + Although the tile-based QR decomposition requires only dependencies, i.e.~no additional conflicts are needed to avoid concurrent access to the matrix tiles, we still model each tile as a separate resource in QuickSched such that the scheduler can preferrentially assign tasks using the same tiles to the same thread. +The resources/tiles are initially assigned to the queues in column-major +order, i.e.~the first $\lfloor n_\mathsf{tiles}/n_\mathsf{queues}\rfloor$ +are assigned to the first queue, and so on. The QR decomposition was computed for a $2048\times 2048$ random matrix using tiles of size $64\times 64$ floats using QuickSched as described above. -For this matrix, a total of 11440 tasks with 32240 dependencies +The task costs were initialized to the asymptotic cost of the underlying +operations. +For this matrix, a total of 11\,440 tasks with 32\,240 dependencies were generated. For these tests, {\tt pthread} parallelism and resource re-owning @@ -978,16 +1010,21 @@ efficiency results in \fig{QRResults}. The timings are for {\tt qsched\_run}, including the cost of {\tt qsched\_start}, which does not run in parallel. Setting up the scheduler, tasks, and resources took, in all -cases, an average of 7.2\,ms. +cases, an average of 7.2\,ms, i.e.~at most 3\% of the total +computational cost. The same decomposition was implemented using OmpSs v.\,1.99.0, calling the kernels directly using {\tt \#pragma omp task} annotations with the respective dependencies, and the runtime parameters \begin{quote} - \tt --disable-yield --schedule=socket --cores-per-socket=16 \\--num-sockets=4 + \tt --disable-yield --schedule=socket --cores-per-socket=16 \\--num-sockets=4 \end{quote} -\noindent The scaling and efficiency relative to QuickSched are +\noindent Several different schedulers and parameterizations +were discussed with the authors of OmpSs and tested, with +the above settings produced the best results. + +The scaling and efficiency relative to QuickSched are shown in \fig{QRResults}. The difference in timings is the result of the different task scheduling policies, as well as a smaller lag between the @@ -1030,7 +1067,7 @@ scheduling seen in \fig{QRTasks}. \subsection{Task-Based Barnes-Hut N-Body Solver} -The Barnes-Hut tree-code \cite{ref:Barnes1986} +The Barnes-Hut tree-code \citep{ref:Barnes1986} is an algorithm to approximate the solution of an $N$-body problem, i.e.~computing all the pairwise interactions between a set of $N$ particles, @@ -1070,7 +1107,7 @@ The current approach, illustrated in \fig{CellParts} is not only more compact, it also allows a direct and more cache-efficient access to the list of particles for any inner node of the tree. The cost of sorting the particles, with a recursive -partitioning similar to QuickSort \cite{ref:Hoare1962}, +partitioning similar to QuickSort \citep{ref:Hoare1962}, is in \oh{N\log N}. \begin{figure} @@ -1163,6 +1200,11 @@ Using the above scheme generated 97\,553 tasks, of which 512 self-interaction tasks, 5\,068 particle-particle interaction task, and 32\,768 particle-cell interaction tasks. A total of 43\,416 locks on 37\,449 resources were generated. +Setting up the tasks took, on average XXX\,ms, i.e.~at most +XXX\% of the total computation time. +Storing the tasks, resources, and dependencies required XXX\,MB, +which is only XX\% of the XXX\,MB required to store the particle +data. For these tests, {\tt pthread}s parallelism was used and resource re-owning was switched off. @@ -1179,10 +1221,10 @@ Setting up the scheduler, tasks, and resources took, in all cases, an average of 51.3\,ms. For comparison, the same computations were run using the popular -astrophysics simulation software Gadget-2 \cite{ref:Springel2005}, +astrophysics simulation software Gadget-2 \citep{ref:Springel2005}, using a traditional Barnes-Hut implementation based on octrees and distributed-memory parallelism based on domain decompositions -and MPI \cite{ref:Snir1998}. +and MPI \citep{ref:Snir1998}. To achieve the same accuracy, an opening angle of 0.5 was used. On a single core, the task-based tree traversal is already 1.9$\times$ faster than Gadget-2, due to the cache efficiency of the task-based @@ -1339,7 +1381,7 @@ This work was supported by a Durham University Seedcorn Grant. % Bibliography -\bibliographystyle{elsarticle-num} +% \bibliographystyle{elsarticle-num} \bibliography{quicksched} @@ -1529,7 +1571,7 @@ Similarly, {\tt rid} stores the handles of the resources for each tile of the matrix, which are allocated in line~8. The following loops mirror the task generation described in -Algorithm~2 of \cite{ref:Buttari2009}. +Algorithm~2 of \citep{ref:Buttari2009}. For each level {\tt k} (line~10), a DGEQRF task is created for tile $(k,k)$ (lines~13--14). A lock is added for the newly created task on the @@ -1584,7 +1626,7 @@ struct cell { struct cell *progeny[8]; qsched_res_t res; qsched_task_t task_com; - }; +}; \end{lstlisting} \end{minipage}\end{center} \noindent where {\tt loc} and {\tt h} are the location @@ -1606,7 +1648,7 @@ data of the form: struct part { double x[3], a[3], mass; int id; - }; +}; \end{lstlisting} \end{minipage}\end{center} \noindent i.e.~the particle position, acceleration, mass, -- GitLab