Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
SWIFTsim
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Deploy
Releases
Model registry
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
SWIFT
SWIFTsim
Commits
e96153f8
Commit
e96153f8
authored
9 years ago
by
Pedro Gonnet
Browse files
Options
Downloads
Patches
Plain Diff
wrote-up intro to section 3.
parent
3cfdc638
No related branches found
No related tags found
2 merge requests
!136
Master
,
!80
PASC paper
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
theory/paper_pasc/pasc_paper.tex
+25
-9
25 additions, 9 deletions
theory/paper_pasc/pasc_paper.tex
with
25 additions
and
9 deletions
theory/paper_pasc/pasc_paper.tex
+
25
−
9
View file @
e96153f8
...
@@ -273,8 +273,23 @@ analysis).
...
@@ -273,8 +273,23 @@ analysis).
\section
{
Parallelisation strategy
}
\section
{
Parallelisation strategy
}
{
\em
Some words on how we wanted to be fully hybrid, dynamic,
One of the main concerns when developing
\swift
was to break
and asynchronous.
}
with the branch-and-bound type parallelism inherent to parallel
codes using OpenMP and MPI, and the constant synchronization
between computational steps it results in.
If
{
\em
synchronisation
}
is the main problem, then
{
\em
asynchronicity
}
is the obvious solution.
We therefore opted for a
{
\em
task-based
}
approach for maximum
single-node, or shared-memory, performance.
This approach not only provides excellent load-balancing on a single
node, it also provides a powerful model of the computation that
can be used to partition the work equitably over a set of
distributed-memory nodes using general-purpose graph partitioning
algorithms.
Finally, the necessary communication between nodes can itself be
modelled in a task-based way, interleaving communication seamlesly
with the rest of the computation.
\subsection
{
Task-based parallelism
}
\subsection
{
Task-based parallelism
}
...
@@ -501,16 +516,16 @@ One direct consequence of this approach is that instead of a single
...
@@ -501,16 +516,16 @@ One direct consequence of this approach is that instead of a single
{
\tt
send
}
/
{
\tt
recv
}
call between each pair of neighbouring ranks,
{
\tt
send
}
/
{
\tt
recv
}
call between each pair of neighbouring ranks,
one such pair is generated for each particle cell.
one such pair is generated for each particle cell.
This type of communication, i.e.~several small messages instead of
This type of communication, i.e.~several small messages instead of
one large message, is usually discouraged since the sum of
the latencies
one large message, is usually
strongly
discouraged since the sum of
for the small messages is usually much larger than
the latency of
the latencies
for the small messages is usually much larger than
the single large message.
the latency of
the single large message.
This, however, is not a concern since nobody is actually waiting
This, however, is not a concern since nobody is actually waiting
to receive the messages in order and the latencies are covered
to receive the messages in order and the latencies are covered
by local computations.
by local computations.
A nice side-effect of this approach is that communication no longer
A nice side-effect of this approach is that communication no longer
happens in bursts involving all the ranks at the same time, but
happens in bursts involving all the ranks at the same time, but
is more or less evenly spread over the entire computation,
thu
s
is more or less evenly spread over the entire computation,
and i
s
being
less demanding of the communication infrastructure.
therefore
less demanding of the communication infrastructure.
...
@@ -547,8 +562,9 @@ removed the first and last ones, where i/o occurs.
...
@@ -547,8 +562,9 @@ removed the first and last ones, where i/o occurs.
almost
$
1000
$
across the simulation volume.
\label
{
fig:ICs
}}
almost
$
1000
$
across the simulation volume.
\label
{
fig:ICs
}}
\end{figure}
\end{figure}
On all the machines, the code was compiled without switching on explicit
On all the machines, the code was compiled out of the box,
vectorization nor any architecture-specific flags.
without any tuning, explicit vectorization, or exploiting any
other specific features of the underlying hardware.
\subsection
{
x86 architecture: Cosma-5
}
\subsection
{
x86 architecture: Cosma-5
}
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment