Skip to content
Snippets Groups Projects
Commit e96153f8 authored by Pedro Gonnet's avatar Pedro Gonnet
Browse files

wrote-up intro to section 3.

parent 3cfdc638
No related branches found
No related tags found
2 merge requests!136Master,!80PASC paper
...@@ -273,8 +273,23 @@ analysis). ...@@ -273,8 +273,23 @@ analysis).
\section{Parallelisation strategy} \section{Parallelisation strategy}
{\em Some words on how we wanted to be fully hybrid, dynamic, One of the main concerns when developing \swift was to break
and asynchronous.} with the branch-and-bound type parallelism inherent to parallel
codes using OpenMP and MPI, and the constant synchronization
between computational steps it results in.
If {\em synchronisation} is the main problem, then {\em asynchronicity}
is the obvious solution.
We therefore opted for a {\em task-based} approach for maximum
single-node, or shared-memory, performance.
This approach not only provides excellent load-balancing on a single
node, it also provides a powerful model of the computation that
can be used to partition the work equitably over a set of
distributed-memory nodes using general-purpose graph partitioning
algorithms.
Finally, the necessary communication between nodes can itself be
modelled in a task-based way, interleaving communication seamlesly
with the rest of the computation.
\subsection{Task-based parallelism} \subsection{Task-based parallelism}
...@@ -501,16 +516,16 @@ One direct consequence of this approach is that instead of a single ...@@ -501,16 +516,16 @@ One direct consequence of this approach is that instead of a single
{\tt send}/{\tt recv} call between each pair of neighbouring ranks, {\tt send}/{\tt recv} call between each pair of neighbouring ranks,
one such pair is generated for each particle cell. one such pair is generated for each particle cell.
This type of communication, i.e.~several small messages instead of This type of communication, i.e.~several small messages instead of
one large message, is usually discouraged since the sum of the latencies one large message, is usually strongly discouraged since the sum of
for the small messages is usually much larger than the latency of the latencies for the small messages is usually much larger than
the single large message. the latency of the single large message.
This, however, is not a concern since nobody is actually waiting This, however, is not a concern since nobody is actually waiting
to receive the messages in order and the latencies are covered to receive the messages in order and the latencies are covered
by local computations. by local computations.
A nice side-effect of this approach is that communication no longer A nice side-effect of this approach is that communication no longer
happens in bursts involving all the ranks at the same time, but happens in bursts involving all the ranks at the same time, but
is more or less evenly spread over the entire computation, thus is more or less evenly spread over the entire computation, and is
being less demanding of the communication infrastructure. therefore less demanding of the communication infrastructure.
...@@ -547,8 +562,9 @@ removed the first and last ones, where i/o occurs. ...@@ -547,8 +562,9 @@ removed the first and last ones, where i/o occurs.
almost $1000$ across the simulation volume. \label{fig:ICs}} almost $1000$ across the simulation volume. \label{fig:ICs}}
\end{figure} \end{figure}
On all the machines, the code was compiled without switching on explicit On all the machines, the code was compiled out of the box,
vectorization nor any architecture-specific flags. without any tuning, explicit vectorization, or exploiting any
other specific features of the underlying hardware.
\subsection{x86 architecture: Cosma-5} \subsection{x86 architecture: Cosma-5}
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment