Skip to content
GitLab
Projects
Groups
Snippets
/
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
SWIFT
SWIFTsim
Commits
e92a09c7
Commit
e92a09c7
authored
Jan 22, 2016
by
Pedro Gonnet
Browse files
spellcheck, defaulted to US.
parent
7d744337
Changes
1
Hide whitespace changes
Inline
Side-by-side
theory/paper_pasc/pasc_paper.tex
View file @
e92a09c7
...
...
@@ -112,7 +112,7 @@ strong scaling on more than 100\,000 cores.}
\item
\textbf
{
Fully dynamic and asynchronous communication
}
,
in which communication is modelled as just another task in
the task-based scheme, sending data whenever it is ready and
deferrin on tasks that rely on data from other nodes
deferrin
g
on tasks that rely on data from other nodes
until it arrives.
\end{itemize}
...
...
@@ -132,7 +132,7 @@ strong scaling on more than 100\,000 cores.}
%% scaling on a variety of architectures, ranging from x86 Tier-2 systems to the
%% largest Tier-0 machines currently available. It displays, for instance, a
%% \emph{strong} scaling parallel efficiency of more than 60\% when going from
%% 512 to 131072 cores on a BlueGene architecture. Similar results are obtained
%% 512 to 131072 cores on a Blue
Gene architecture. Similar results are obtained
%% on standard clusters of x86 CPUs.
%% The task-based library, \qs, used as the backbone of the code is
...
...
@@ -284,14 +284,14 @@ analysis).
%#####################################################################################################
\section
{
Paralleli
s
ation strategy
}
\section
{
Paralleli
z
ation strategy
}
One of the main concerns when developing
\swift
was to break
with the branch-and-bound type parallelism inherent to parallel
codes using OpenMP and MPI, and the constant synchroni
s
ation
codes using OpenMP and MPI, and the constant synchroni
z
ation
between computational steps it results in.
If
{
\em
synchroni
s
ation
}
is the main problem, then
{
\em
If
{
\em
synchroni
z
ation
}
is the main problem, then
{
\em
asynchronicity
}
is the obvious solution. We therefore opted for a
{
\em
task-based
}
approach for maximum single-node, or shared-memory,
performance. This approach not only provides excellent load-balancing
...
...
@@ -327,7 +327,7 @@ The main advantages of using a task-based approach are
are assigned to each processor is completely
dynamic and adapts automatically to load imbalances.
\item
If the dependencies and conflicts are specified correctly,
there is no need for expensive explicit locking, synchroni
s
ation,
there is no need for expensive explicit locking, synchroni
z
ation,
or atomic operations to deal with most concurrency problems.
\item
Each task has exclusive access to the data it is working on,
thus improving cache locality and efficiency.
...
...
@@ -388,7 +388,7 @@ cores of a shared-memory machine \cite{gonnet2015efficient}.
\caption
{
Task hierarchy for the SPH computations in
\swift
,
including communication tasks. Arrows indicate dependencies,
i.e.~an arrow from task
$
A
$
to task
$
B
$
indicates that
$
A
$
depends on
$
B
$
. The task colo
u
r corresponds to the cell or
depends on
$
B
$
. The task color corresponds to the cell or
cells it operates on, e.g.~the density and force tasks work
on individual cells or pairs of cells.
The blue cell data is on a separate rank as the yellow and
...
...
@@ -485,7 +485,7 @@ once computation has finished, and so on.
This approach, although conceptually simple and easy to implement,
has three major drawbacks:
\begin{itemize}
\item
The frequent synchroni
s
ation points between communication
\item
The frequent synchroni
z
ation points between communication
and computation exacerbate load imbalances,
\item
the communication phase consists mainly of waiting on
latencies, during which the node's CPUs usually run idle, and
...
...
@@ -613,7 +613,7 @@ the 16 node run.
\centering
\includegraphics
[width=\columnwidth]
{
Figures/domains
}
\caption
{
The particles for the initial conditions shown on Fig.~
\ref
{
fig:ICs
}
colo
u
red according to the node they belong to after a load-balancing call on
colored according to the node they belong to after a load-balancing call on
32 nodes. As can be seen, the domain decomposition follows the cells in the mesh
but is not made of regular cuts. Domains have different shapes and
sizes.
\label
{
fig:domains
}}
...
...
@@ -641,7 +641,7 @@ the 16 node run.
For our next test, we ran
\swift
on the SuperMUC x86 phase~1 thin
nodes
\footnote
{
\url
{
https://www.lrz.de/services/compute/supermuc/systemdescription/
}}
located at the Leibniz Supercomputing Cent
r
e in Garching near Munich. This
located at the Leibniz Supercomputing Cente
r
in Garching near Munich. This
system consists of 9
\,
216 nodes with 2 Intel Sandy Bridge-EP Xeon E5-2680
8C
\footnote
{
\url
{
http://ark.intel.com/products/64583/Intel-Xeon-Processor-E5-2680-(20M-Cache-2
_
70-GHz-8
_
00-GTs-Intel-QPI)
}}
at
$
2
.
7
~
\rm
{
GHz
}$
CPUS. Each 16-core node has
$
32
~
\rm
{
GByte
}$
of RAM.
...
...
@@ -677,11 +677,11 @@ threads per node, i.e.~one thread per physical core.
\end{figure*}
\subsection
{
BlueGene architecture: JUQUEEN
}
\subsection
{
Blue
Gene architecture: JUQUEEN
}
For our last set of tests, we ran
\swift
on the JUQUEEN IBM BlueGene/Q
For our last set of tests, we ran
\swift
on the JUQUEEN IBM Blue
Gene/Q
system
\footnote
{
\url
{
http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JUQUEEN/Configuration/Configuration
_
node.html
}}
located at the J
\"
ulich Supercomputing Cent
r
e. This system consists of
located at the J
\"
ulich Supercomputing Cente
r
. This system consists of
28
\,
672 nodes with an IBM PowerPC A2 processor running at
$
1
.
6
~
\rm
{
GHz
}$
and
$
16
~
\rm
{
GByte
}$
of RAM each. Of notable interest
is the presence of two floating units per compute core. The system is
...
...
@@ -711,7 +711,7 @@ test are shown in Fig.~\ref{fig:JUQUEEN2}.
\centering
\includegraphics
[width=\columnwidth]
{
Figures/scalingInNode
}
\caption
{
Strong scaling test of the hybrid component of the code. The
same calculation is performed on 512 node of the JUQUEEN BlueGene
same calculation is performed on 512 node of the JUQUEEN Blue
Gene
supercomputer (see text for hardware description) using a single MPI
rank per node and varying only the number of
threads per node. The code displays excellent scaling even when all the cores and
...
...
@@ -723,7 +723,7 @@ test are shown in Fig.~\ref{fig:JUQUEEN2}.
\begin{figure*}
\centering
\includegraphics
[width=\textwidth]
{
Figures/scalingBlueGene
}
\caption
{
Strong scaling test on the JUQUEEN BlueGene machine (see text
\caption
{
Strong scaling test on the JUQUEEN Blue
Gene machine (see text
for hardware description).
\textit
{
Left panel:
}
Code
Speed-up.
\textit
{
Right panel:
}
Corresponding parallel efficiency.
Using 32 threads per node (2 per physical core) with one MPI rank
...
...
@@ -750,14 +750,14 @@ machines, thanks to the use of task-based parallelism at the node level, and on
the largest machines (Tier-0 systems) currently available, thanks to the
task-based domain distribution and asynchronous communication schemes.
We would like to emphasize that these results were obtained for a
realistic test case without any micro-level optimi
s
ation or explicit
vectori
s
ation.
realistic test case without any micro-level optimi
z
ation or explicit
vectori
z
ation.
Excellent strong scaling is also achieved when increasing the number of threads
per node (i.e.~per MPI rank, see fig.~
\ref
{
fig:JUQUEEN1
}
), demonstrating that
the description of MPI (asynchronous) communications as tasks within our
framework is not a bottleneck. One common conception in HPC is that the number
of MPI communications between nodes should be kept to a minimum to optimi
s
e the
of MPI communications between nodes should be kept to a minimum to optimi
z
e the
efficiency of the calculation. Our approach does exactly the opposite with large
number of point-to-point communications between pairs of nodes occurring over the
course of a time-step. For example, on the SuperMUC machine with 32 nodes (512
...
...
@@ -787,7 +787,7 @@ The excellent scaling
performance of
\swift
allows us to push this number further by simply increasing
the number of cores, whilst
\gadget
reaches its peak speed (for this problem) at
around 300 cores and stops scaling beyond that. This unprecedented scaling
ability combined with future work on vectori
s
ation of the calculations within
ability combined with future work on vectori
z
ation of the calculations within
each task will hopefully make
\swift
an important tool for future simulations in
cosmology and help push the entire field to a new level.
...
...
@@ -802,7 +802,7 @@ This work would not have been possible without Lydia Heck's help and
expertise. We acknowledge the help of Tom Theuns, James Willis, Bert
Vandenbroucke, Angus Lepper and other contributors to the
\swift
code.
\\
We thank Heinrich Bockhorst and Stephen Blair-Chappell from
{
\sc
intel
}
as well as Dirk Brommel from the J
\"
ulich Computing Cent
r
e
intel
}
as well as Dirk Brommel from the J
\"
ulich Computing Cente
r
and Nikolay J. Hammer from the Leibniz Rechnenzentrum for their help
at various stages of this project.
\\
This work used the DiRAC Data
Centric system at Durham University, operated by the Institute for
...
...
Write
Preview
Supports
Markdown
0%
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment