Skip to content
Snippets Groups Projects
Commit 1de43e5a authored by Pedro Gonnet's avatar Pedro Gonnet
Browse files

final corrections.

parent 59832636
No related branches found
No related tags found
No related merge requests found
File added
...@@ -193,7 +193,7 @@ This paper presents QuickSched, a framework for task-based ...@@ -193,7 +193,7 @@ This paper presents QuickSched, a framework for task-based
parallel programming with constraints, which aims to achieve parallel programming with constraints, which aims to achieve
the following goals: the following goals:
\begin{itemize} \begin{itemize}
\item {\em Correctnes}: All constraints, i.e.~dependencies and \item {\em Correctness}: All constraints, i.e.~dependencies and
conflicts, must be correctly enforced, conflicts, must be correctly enforced,
\item {\em Speed}: The overheads associated with task management \item {\em Speed}: The overheads associated with task management
should be as small as possible, should be as small as possible,
...@@ -271,7 +271,7 @@ and thus implicitly all their spawned tasks, before executing ...@@ -271,7 +271,7 @@ and thus implicitly all their spawned tasks, before executing
$E$ and $K$. $E$ and $K$.
\begin{figure} \begin{figure}
\centerline{\epsfig{file=figures/Spawn.pdf,width=0.7\textwidth}} \centerline{\epsfig{file=figures/Spawn.pdf,width=0.9\textwidth}}
\caption{Two different task graphs and how they can be implemented \caption{Two different task graphs and how they can be implemented
using spawning and waiting. using spawning and waiting.
For the task graph on the left, each task spawns its dependent For the task graph on the left, each task spawns its dependent
...@@ -358,7 +358,7 @@ decomposition is too coarse, then good parallelism ...@@ -358,7 +358,7 @@ decomposition is too coarse, then good parallelism
and load-balancing will be difficult to achieve. and load-balancing will be difficult to achieve.
Converseley, if the tasks are too small, the costs of selecting and Converseley, if the tasks are too small, the costs of selecting and
scheduling tasks, which is usually constant per task, will scheduling tasks, which is usually constant per task, will
quickly destory any performance gains from parallelism. quickly destroy any performance gains from parallelism.
Starting from a per-statement set of tasks, it is therefore Starting from a per-statement set of tasks, it is therefore
reasonable to group them by their dependencies and shared resources. reasonable to group them by their dependencies and shared resources.
...@@ -388,7 +388,7 @@ how the work is done, i.e. which tasks get scheduled ...@@ -388,7 +388,7 @@ how the work is done, i.e. which tasks get scheduled
where and when, respectively. where and when, respectively.
\begin{figure} \begin{figure}
\centerline{\epsfig{file=figures/QSched.pdf,width=0.7\textwidth}} \centerline{\epsfig{file=figures/QSched.pdf,width=0.8\textwidth}}
\caption{Schematic of the QuickSched task scheduler. \caption{Schematic of the QuickSched task scheduler.
The tasks (circles) are stored in the scheduler (left). The tasks (circles) are stored in the scheduler (left).
Once a task's dependencies have been resolved, the task Once a task's dependencies have been resolved, the task
...@@ -547,7 +547,7 @@ Likewise, if a resource is locked, it cannot be held ...@@ -547,7 +547,7 @@ Likewise, if a resource is locked, it cannot be held
(see \fig{Resources}). (see \fig{Resources}).
\begin{figure} \begin{figure}
\centerline{\epsfig{file=figures/Resources.pdf,width=0.6\textwidth}} \centerline{\epsfig{file=figures/Resources.pdf,width=0.7\textwidth}}
\caption{A hierarchy of cells (left) and the hierarchy of \caption{A hierarchy of cells (left) and the hierarchy of
corresponding hierarchical resources at each level. corresponding hierarchical resources at each level.
Each square on the right represents a single resource, and Each square on the right represents a single resource, and
...@@ -737,7 +737,7 @@ two tasks attempt, simultaneously, to lock the resources $A$ and $B$; ...@@ -737,7 +737,7 @@ two tasks attempt, simultaneously, to lock the resources $A$ and $B$;
and $B$ and $A$, respectively, via separate queues, their respective calls and $B$ and $A$, respectively, via separate queues, their respective calls
to {\tt queue\_get} will potentially fail perpetually. to {\tt queue\_get} will potentially fail perpetually.
This type of deadlock, however, is easily avoided by sorting the This type of deadlock, however, is easily avoided by sorting the
resources in each task according to some global creiteria, e.g.~the resources in each task according to some global criteria, e.g.~the
resource ID or the address in memory of the resource. resource ID or the address in memory of the resource.
\subsection{Scheduler} \subsection{Scheduler}
...@@ -918,7 +918,7 @@ designed for this specific task, while the latter currently uses ...@@ -918,7 +918,7 @@ designed for this specific task, while the latter currently uses
the StarPU task scheduler \cite{ref:Agullo2011}. the StarPU task scheduler \cite{ref:Agullo2011}.
\begin{figure} \begin{figure}
\centerline{\epsfig{file=figures/QR.pdf,width=0.8\textwidth}} \centerline{\epsfig{file=figures/QR.pdf,width=0.9\textwidth}}
\caption{Task-based QR decomposition of a matrix consisting \caption{Task-based QR decomposition of a matrix consisting
of $4\times 4$ tiles. of $4\times 4$ tiles.
Each circle represents a tile, and its color represents Each circle represents a tile, and its color represents
...@@ -958,6 +958,11 @@ previous level, i.e.~the task $(i,j,k)$ always depends on ...@@ -958,6 +958,11 @@ previous level, i.e.~the task $(i,j,k)$ always depends on
$(i,j,k-1)$ for $k>1$. $(i,j,k-1)$ for $k>1$.
Each task also modifies its own tile $(i,j)$, and the DTSQRF Each task also modifies its own tile $(i,j)$, and the DTSQRF
task additionally modifies the lower triangular part of the $(j,j)$th tile. task additionally modifies the lower triangular part of the $(j,j)$th tile.
Although the tile-based QR decomposition requires only dependencies,
i.e.~no additional conflicts are needed to avoid concurrent access to
the matrix tiles, we still model each tile as a separate resource
in QuickSched such that the scheduler can preferrentially assign
tasks using the same tiles to the same thread.
The QR decomposition was computed for a $2048\times 2048$ The QR decomposition was computed for a $2048\times 2048$
random matrix using tiles of size $64\times 64$ floats using QuickSched random matrix using tiles of size $64\times 64$ floats using QuickSched
...@@ -980,7 +985,7 @@ calling the kernels directly using {\tt \#pragma omp task} ...@@ -980,7 +985,7 @@ calling the kernels directly using {\tt \#pragma omp task}
annotations with the respective dependencies, and annotations with the respective dependencies, and
the runtime parameters the runtime parameters
\begin{quote} \begin{quote}
\tt --disable-yield --schedule=socket --cores-per-socket=16 --num-sockets=4 \tt --disable-yield --schedule=socket --cores-per-socket=16 \\--num-sockets=4
\end{quote} \end{quote}
\noindent The scaling and efficiency relative to QuickSched are \noindent The scaling and efficiency relative to QuickSched are
shown in \fig{QRResults}. shown in \fig{QRResults}.
...@@ -1002,7 +1007,7 @@ OmpSs, does not exploit this knowledge, resulting in the less efficient ...@@ -1002,7 +1007,7 @@ OmpSs, does not exploit this knowledge, resulting in the less efficient
scheduling seen in \fig{QRTasks}. scheduling seen in \fig{QRTasks}.
\begin{figure} \begin{figure}
\centerline{\epsfig{file=figures/QR_scaling.pdf,width=0.9\textwidth}} \centerline{\epsfig{file=figures/QR_scaling.pdf,width=\textwidth}}
\caption{Strong scaling and parallel efficiency of the tiled QR decomposition \caption{Strong scaling and parallel efficiency of the tiled QR decomposition
computed over a $2048\times 2048$ matrix with tiles of size computed over a $2048\times 2048$ matrix with tiles of size
$64\times 64$. $64\times 64$.
...@@ -1014,8 +1019,8 @@ scheduling seen in \fig{QRTasks}. ...@@ -1014,8 +1019,8 @@ scheduling seen in \fig{QRTasks}.
\end{figure} \end{figure}
\begin{figure} \begin{figure}
\centerline{\epsfig{file=figures/tasks_qr.pdf,width=0.9\textwidth}} \centerline{\epsfig{file=figures/tasks_qr.pdf,width=\textwidth}}
\centerline{\epsfig{file=figures/tasks_qr_ompss.pdf,width=0.9\textwidth}} \centerline{\epsfig{file=figures/tasks_qr_ompss.pdf,width=\textwidth}}
\caption{Task scheduling in QuickSched (above) and OmpSs (below) \caption{Task scheduling in QuickSched (above) and OmpSs (below)
for a $2048\times 2048$ matrix on 64 cores. for a $2048\times 2048$ matrix on 64 cores.
The task colors correspond to those in \fig{QR}.} The task colors correspond to those in \fig{QR}.}
...@@ -1025,7 +1030,8 @@ scheduling seen in \fig{QRTasks}. ...@@ -1025,7 +1030,8 @@ scheduling seen in \fig{QRTasks}.
\subsection{Task-Based Barnes-Hut N-Body Solver} \subsection{Task-Based Barnes-Hut N-Body Solver}
The Barnes-Hut tree-code is an algorithm to approximate the The Barnes-Hut tree-code \cite{ref:Barnes1986}
is an algorithm to approximate the
solution of an $N$-body problem, i.e.~computing all the solution of an $N$-body problem, i.e.~computing all the
pairwise interactions between a set of $N$ particles, pairwise interactions between a set of $N$ particles,
in \oh{N\log N} operations, as opposed to the \oh{N^2} in \oh{N\log N} operations, as opposed to the \oh{N^2}
...@@ -1188,7 +1194,7 @@ due to the better strong scaling of the task-based approach as opposed ...@@ -1188,7 +1194,7 @@ due to the better strong scaling of the task-based approach as opposed
to the MPI-based parallelism in Gadget-2. to the MPI-based parallelism in Gadget-2.
\begin{figure} \begin{figure}
\centerline{\epsfig{file=figures/BH_scaling.pdf,width=0.9\textwidth}} \centerline{\epsfig{file=figures/BH_scaling.pdf,width=\textwidth}}
\caption{Strong scaling and parallel efficiency of the Barnes-Hut tree-code \caption{Strong scaling and parallel efficiency of the Barnes-Hut tree-code
computed over 1\,000\,000 particles. computed over 1\,000\,000 particles.
Solving the N-Body problem takes 323\,ms, achieving 75\% parallel Solving the N-Body problem takes 323\,ms, achieving 75\% parallel
...@@ -1203,7 +1209,7 @@ to the MPI-based parallelism in Gadget-2. ...@@ -1203,7 +1209,7 @@ to the MPI-based parallelism in Gadget-2.
\end{figure} \end{figure}
\begin{figure} \begin{figure}
\centerline{\epsfig{file=figures/tasks_bh_dynamic_64.pdf,width=0.9\textwidth}} \centerline{\epsfig{file=figures/tasks_bh_dynamic_64.pdf,width=\textwidth}}
\caption{Task scheduling of the Barnes-Hut tree-code on 64 cores. \caption{Task scheduling of the Barnes-Hut tree-code on 64 cores.
The red tasks correspond to particle self-interactions, the green The red tasks correspond to particle self-interactions, the green
tasks to the particle-particle pair interactions, and the blue tasks to the particle-particle pair interactions, and the blue
...@@ -1223,7 +1229,7 @@ At 64 cores, the scheduler overheads account for only $\sim 1$\% of ...@@ -1223,7 +1229,7 @@ At 64 cores, the scheduler overheads account for only $\sim 1$\% of
the total computational cost, whereas, the total computational cost, whereas,
as of 32 cores, the cost of both pair types grow by up to as of 32 cores, the cost of both pair types grow by up to
40\%. 40\%.
This is most probably due to memory bandwidth restrictions, as This is due to memory bandwidth restrictions, as
the cost of the particle-cell interaction tasks, which do significantly more the cost of the particle-cell interaction tasks, which do significantly more
computation per memory access, only grow by up to 10\%. computation per memory access, only grow by up to 10\%.
......
@article{ref:Barnes1986,
title={A hierarchical O (N log N) force-calculation algorithm},
author={Barnes, Josh and Hut, Piet},
year={1986},
journal={Nature},
publisher={Nature Publishing Group}
}
@book{ref:Snir1998, @book{ref:Snir1998,
title={{MPI}: The Complete Reference (Vol. 1): Volume 1-The {MPI} Core}, title={{MPI}: The Complete Reference (Vol. 1): Volume 1-The {MPI} Core},
author={Snir, Marc and Otto, Steve and Huss-Lederman, Steven and Walker, David and Dongarra, Jack}, author={Snir, Marc and Otto, Steve and Huss-Lederman, Steven and Walker, David and Dongarra, Jack},
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment