Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
SWIFTsim
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Deploy
Releases
Model registry
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
SWIFT
SWIFTsim
Commits
189416cb
Commit
189416cb
authored
12 years ago
by
Pedro Gonnet
Browse files
Options
Downloads
Patches
Plain Diff
latest modifications to paper.
Former-commit-id: e53686ef167cd1dc61e4fa4a50fff34963ea51e8
parent
0c82d783
Branches
Branches containing commit
Tags
Tags containing commit
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
theory/paper_algs/paper.tex
+31
-14
31 additions, 14 deletions
theory/paper_algs/paper.tex
with
31 additions
and
14 deletions
theory/paper_algs/paper.tex
+
31
−
14
View file @
189416cb
...
@@ -99,7 +99,7 @@ A new framework for the parallelization of Smoothed Particle Hydrodynamics (SPH)
...
@@ -99,7 +99,7 @@ A new framework for the parallelization of Smoothed Particle Hydrodynamics (SPH)
simulations on shared-memory parallel architectures is described.
simulations on shared-memory parallel architectures is described.
This framework relies on fast and cache-efficient cell-based neighbour-finding
This framework relies on fast and cache-efficient cell-based neighbour-finding
algorithms, as well as task-based parallelism to achieve good scaling and
algorithms, as well as task-based parallelism to achieve good scaling and
parallel efficiency on mult-core computers.
parallel efficiency on mult
i
-core computers.
\end{abstract}
\end{abstract}
...
@@ -497,7 +497,7 @@ This reduces the \oh{n\log{n}} sorting to \oh{n} for merging.
...
@@ -497,7 +497,7 @@ This reduces the \oh{n\log{n}} sorting to \oh{n} for merging.
The arguably most well-known paradigm for shared-memory,
The arguably most well-known paradigm for shared-memory,
or thread-based parallelism, is OpenMP, in which
or thread-based parallelism, is OpenMP, in which
compiler annotations are used to describe if and when
compiler annotations are used to describe if and when
specific loops or portions of the code can be execu
d
ed
specific loops or portions of the code can be execu
t
ed
in parallel.
in parallel.
When such a parallel section, e.g.~a parallel loop, is
When such a parallel section, e.g.~a parallel loop, is
encountered, the sections of the loop are split statically
encountered, the sections of the loop are split statically
...
@@ -517,7 +517,7 @@ is inherently parallelisable.
...
@@ -517,7 +517,7 @@ is inherently parallelisable.
One such approach is
{
\em
task-based parallelism
}
, in which the
One such approach is
{
\em
task-based parallelism
}
, in which the
computation is divided into a number of inter-dependent
computation is divided into a number of inter-dependent
computational tasks, which are then scheduled, concurrently
computational tasks, which are then scheduled, concurrently
and a
y
snchronously, to a number of processors.
and as
y
nchronously, to a number of processors.
In order to ensure that the tasks are executed in the right
In order to ensure that the tasks are executed in the right
order, e.g.~that data needed by one task is only used once it
order, e.g.~that data needed by one task is only used once it
has been produced by another task, and that no two tasks
has been produced by another task, and that no two tasks
...
@@ -564,7 +564,7 @@ for a given cell, and, in turn, all force computations involving
...
@@ -564,7 +564,7 @@ for a given cell, and, in turn, all force computations involving
that cell depend on its ghost task.
that cell depend on its ghost task.
Using this mechanism, we can enforce that all density computations
Using this mechanism, we can enforce that all density computations
for a set of particles have completed before we use this
for a set of particles have completed before we use this
density in the force computa
i
tons.
density in the force computat
i
ons.
The dependencies and conflicts between tasks are then given as follows:
The dependencies and conflicts between tasks are then given as follows:
...
@@ -634,7 +634,7 @@ The dependencies and conflicts between tasks are then given as follows:
...
@@ -634,7 +634,7 @@ The dependencies and conflicts between tasks are then given as follows:
If the dependencies and conflicts are defined correctly, then
If the dependencies and conflicts are defined correctly, then
there is no risk of concurrency problems and thus each task
there is no risk of concurrency problems and thus each task
can be implemented without special attention to the latter,
can be implemented without special attention to the latter,
e.g.~it can update data without using exclusi
n
ve access barriers
e.g.~it can update data without using exclusive access barriers
or atomic memory updates.
or atomic memory updates.
This, however, requires some care in how the individual tasks
This, however, requires some care in how the individual tasks
are allocated to the computing threads, i.e.~each task should
are allocated to the computing threads, i.e.~each task should
...
@@ -660,7 +660,7 @@ in the queue.
...
@@ -660,7 +660,7 @@ in the queue.
The
{
\tt
pthread
\_
mutex
\_
t lock
}
is used to guarantee exclusive access
The
{
\tt
pthread
\_
mutex
\_
t lock
}
is used to guarantee exclusive access
to the queue.
to the queue.
Task IDs are retr
e
ived from the queue as follows:
Task IDs are retri
e
ved from the queue as follows:
\begin{center}\begin{minipage}
{
0.8
\textwidth
}
\begin{center}\begin{minipage}
{
0.8
\textwidth
}
\begin{lstlisting}
\begin{lstlisting}
...
@@ -699,11 +699,11 @@ The lock on the queue is then released (line~12) and
...
@@ -699,11 +699,11 @@ The lock on the queue is then released (line~12) and
the task ID, or
{
\tt
-1
}
if no available task was found, is
the task ID, or
{
\tt
-1
}
if no available task was found, is
returned.
returned.
The advantage of swapping the retr
e
ived task to the next
The advantage of swapping the retri
e
ved task to the next
position in the list is that if the queue is reset, e.g.~
{
\tt
next
}
position in the list is that if the queue is reset, e.g.~
{
\tt
next
}
is set to zero, and used again with the same set of tasks,
is set to zero, and used again with the same set of tasks,
they will now be traversed in the order in which they were
they will now be traversed in the order in which they were
exec
t
uted in the previous run.
executed in the previous run.
This provides a basic form of iterative refinement of the task
This provides a basic form of iterative refinement of the task
order.
order.
The tasks can also be sorted topologically, according to their
The tasks can also be sorted topologically, according to their
...
@@ -718,14 +718,14 @@ a large number of threads.
...
@@ -718,14 +718,14 @@ a large number of threads.
One way of avoiding this problem is to use several concurrent
One way of avoiding this problem is to use several concurrent
queues, e.g.~one queue per thread, and spread the tasks over
queues, e.g.~one queue per thread, and spread the tasks over
all queues.
all queues.
A fixed assign
e
mnt of tasks to queues can, however,
A fixed assignm
e
nt of tasks to queues can, however,
cause load balancing problems, e.g.~when a thread's queue is
cause load balancing problems, e.g.~when a thread's queue is
empty before the others have finished.
empty before the others have finished.
In order to avoid such problems,
{
\em
work-stealing
}
can be used:
In order to avoid such problems,
{
\em
work-stealing
}
can be used:
If a thread cannot obtain a task from its own queue, it picks
If a thread cannot obtain a task from its own queue, it picks
another queue at random and tries to
{
\em
steal
}
a task from it
another queue at random and tries to
{
\em
steal
}
a task from it
i.e. if it can obtain a task, it removes it from the queue and
i.e. if it can obtain a task, it removes it from the queue and
adds it to it's own queue, thus iteratively rebalancing
adds it to it's own queue, thus iteratively re
-
balancing
the task queues if they are used repeatedly:
the task queues if they are used repeatedly:
\begin{center}\begin{minipage}
{
0.8
\textwidth
}
\begin{center}\begin{minipage}
{
0.8
\textwidth
}
...
@@ -821,7 +821,7 @@ void cell_unlocktree ( struct cell c ) {
...
@@ -821,7 +821,7 @@ void cell_unlocktree ( struct cell c ) {
are ``locked'' while the cells marked in yellow have a ``hold'' count
are ``locked'' while the cells marked in yellow have a ``hold'' count
larger than zero.
larger than zero.
The hold count is shown inside each cell and corresponds to the number
The hold count is shown inside each cell and corresponds to the number
of locked cells hierarchicaly below it.
of locked cells hierarchical
l
y below it.
All cells except for those locked or with a ``hold'' count larger than
All cells except for those locked or with a ``hold'' count larger than
zero can still be locked without causing concurrent data access.
zero can still be locked without causing concurrent data access.
}
}
...
@@ -871,13 +871,30 @@ void cell_unlocktree ( struct cell c ) {
...
@@ -871,13 +871,30 @@ void cell_unlocktree ( struct cell c ) {
\begin{itemize}
\begin{itemize}
\item
Scaling for both simulations on different parallel hardware.
\item
Results for a 1.8M particle simulation on a 32-core Intel Xeon X7550
are shown in
\fig
{
Results
}
.
\item
Compare, if possible, with
{
\sc
gadget
}
.
\item
The new simulation code not only scales much better, e.g. achieving
a parallel efficiency of 63
\%
at 32 cores.
\end{itemize}
\end{itemize}
\begin{figure}
[ht]
\centerline
{
\epsfig
{
file=figures/scaling.pdf,width=0.9
\textwidth
}}
\caption
{
Parallel scaling and efficiency for Gadget-2 and GadgetSMP
for a 1.8M particle simulation.
The numbers in the scaling plot are the average number of miliseconds
per simulation time step.
Note that not only does GadgetSMP scale better, it is also up to nine
times faster.
The timings for Gadget-2 are courtesy of Matthieu Schaller of the
Institute of Computational Cosmology at Durham University.
}
\label
{
fig:Results
}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Conclusions
% Conclusions
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment