Skip to content
Snippets Groups Projects
Commit e371464c authored by Matthieu Schaller's avatar Matthieu Schaller
Browse files

Corrected typos

parent 1a35eda8
No related branches found
No related tags found
2 merge requests!136Master,!80PASC paper
...@@ -206,7 +206,7 @@ The particle density $\rho_i$ used in \eqn{interp} is itself computed similarly: ...@@ -206,7 +206,7 @@ The particle density $\rho_i$ used in \eqn{interp} is itself computed similarly:
where $r_{ij} = \|\mathbf{r_i}-\mathbf{r_j}\|$ is the Euclidean distance between where $r_{ij} = \|\mathbf{r_i}-\mathbf{r_j}\|$ is the Euclidean distance between
particles $p_i$ and $p_j$. In compressible simulations, the smoothing length particles $p_i$ and $p_j$. In compressible simulations, the smoothing length
$h_i$ of each particle is chosen such that the number of neighbours with which $h_i$ of each particle is chosen such that the number of neighbours with which
it interacts is kept more or less constant, and can result in smoothing lenghts it interacts is kept more or less constant, and can result in smoothing lengths
spanning several orders of magnitudes within the same simulation. spanning several orders of magnitudes within the same simulation.
Once the densities $\rho_i$ have been computed, the time derivatives of the Once the densities $\rho_i$ have been computed, the time derivatives of the
...@@ -263,7 +263,7 @@ particles and searching for their neighbours in the tree. ...@@ -263,7 +263,7 @@ particles and searching for their neighbours in the tree.
Although such tree traversals are trivial to parallelize, they Although such tree traversals are trivial to parallelize, they
have several disadvantages, e.g.~with regards to computational have several disadvantages, e.g.~with regards to computational
efficiency, cache efficiency, and exploiting symmetries in the efficiency, cache efficiency, and exploiting symmetries in the
computaiton (see \cite{gonnet2015efficient} for a more detailed computation (see \cite{gonnet2015efficient} for a more detailed
analysis). analysis).
...@@ -291,7 +291,7 @@ The main advantages of using a task-based approach are ...@@ -291,7 +291,7 @@ The main advantages of using a task-based approach are
\item The order in which the tasks are processed is completely \item The order in which the tasks are processed is completely
dynamic and adapts automatically to load imbalances. dynamic and adapts automatically to load imbalances.
\item If the dependencies and conflicts are specified correctly, \item If the dependencies and conflicts are specified correctly,
there is no need for expensive explicit locking, synchronization, there is no need for expensive explicit locking, synchronisation,
or atomic operations to deal with most concurrency problems. or atomic operations to deal with most concurrency problems.
\item Each task has exclusive access to the data it is working on, \item Each task has exclusive access to the data it is working on,
thus improving cache locality and efficiency. thus improving cache locality and efficiency.
...@@ -342,7 +342,7 @@ neighbour search on a single core, and scales efficiently to all ...@@ -342,7 +342,7 @@ neighbour search on a single core, and scales efficiently to all
cores of a shared-memory machine \cite{gonnet2015efficient}. cores of a shared-memory machine \cite{gonnet2015efficient}.
\subsection{Task-based domain decompositon} \subsection{Task-based domain decomposition}
Given a task-based description of a computation, partitioning it over Given a task-based description of a computation, partitioning it over
a fixed number of nodes is relatively straight-forward: we create a fixed number of nodes is relatively straight-forward: we create
...@@ -364,7 +364,7 @@ Any task spanning cells that belong to the same partition needs only ...@@ -364,7 +364,7 @@ Any task spanning cells that belong to the same partition needs only
to be evaluated on that rank/partition, and tasks spanning more than to be evaluated on that rank/partition, and tasks spanning more than
one partition need to be evaluated on both ranks/partitions. one partition need to be evaluated on both ranks/partitions.
If we then weight each edge with the computatoinal cost associated with If we then weight each edge with the computational cost associated with
each task, then finding a {\em good} partitioning reduces to finding a each task, then finding a {\em good} partitioning reduces to finding a
partition of the cell graph such that: partition of the cell graph such that:
\begin{itemize} \begin{itemize}
...@@ -384,7 +384,7 @@ the optimal partition for more than two nodes is considered NP-hard.}, ...@@ -384,7 +384,7 @@ the optimal partition for more than two nodes is considered NP-hard.},
e.g.~METIS \cite{ref:Karypis1998} and Zoltan \cite{devine2002zoltan}, e.g.~METIS \cite{ref:Karypis1998} and Zoltan \cite{devine2002zoltan},
exist. exist.
Note that this approach does not explicitly consider any geomertic Note that this approach does not explicitly consider any geometric
constraints, or strive to partition the {\em amount} of data equitably. constraints, or strive to partition the {\em amount} of data equitably.
The only criteria is the computational cost of each partition, for The only criteria is the computational cost of each partition, for
which the task decomposition provides a convenient model. which the task decomposition provides a convenient model.
...@@ -418,12 +418,12 @@ large suite of state-of-the-art cosmological simulations. By selecting outputs ...@@ -418,12 +418,12 @@ large suite of state-of-the-art cosmological simulations. By selecting outputs
at late times, we constructed a simulation setup which is representative of the at late times, we constructed a simulation setup which is representative of the
most expensive part of these simulations, i.e. when the particles are most expensive part of these simulations, i.e. when the particles are
highly-clustered and not uniformly distributed anymore. This distribution of highly-clustered and not uniformly distributed anymore. This distribution of
particles is shown on Fig.~\ref{fig:ICs} and preiodic boundary conditions are particles is shown on Fig.~\ref{fig:ICs} and periodic boundary conditions are
used. In order to fit our simulation setup into the limited memory of some of used. In order to fit our simulation setup into the limited memory of some of
the systems tested, we have randomly downsampled the particle count of the the systems tested, we have randomly down-sampled the particle count of the
output to $800^3=5.12\times10^8$, $600^3=2.16\times10^8$ and output to $800^3=5.12\times10^8$, $600^3=2.16\times10^8$ and
$300^3=2.7\times10^7$ particles respectively. We then run the \swift code for $300^3=2.7\times10^7$ particles respectively. We then run the \swift code for
100 timesteps and average the wallclock time of these timesteps after having 100 time-steps and average the wall clock time of these time-steps after having
removed the first and last ones, where i/o occurs. removed the first and last ones, where i/o occurs.
\begin{figure} \begin{figure}
...@@ -431,8 +431,8 @@ removed the first and last ones, where i/o occurs. ...@@ -431,8 +431,8 @@ removed the first and last ones, where i/o occurs.
\includegraphics[width=\columnwidth]{Figures/cosmoVolume} \includegraphics[width=\columnwidth]{Figures/cosmoVolume}
\caption{The initial density field computed from the initial particle \caption{The initial density field computed from the initial particle
distribution used for our tests. The density $\rho_i$ of the particles spans 8 distribution used for our tests. The density $\rho_i$ of the particles spans 8
orders of magnitude, requiring smoothing lenghts $h_i$ changing by a factor of orders of magnitude, requiring smoothing lengths $h_i$ changing by a factor of
almost $1000$ accross the simulation volume. \label{fig:ICs}} almost $1000$ across the simulation volume. \label{fig:ICs}}
\end{figure} \end{figure}
...@@ -491,7 +491,7 @@ threads per node (i.e. one thread per physical core). ...@@ -491,7 +491,7 @@ threads per node (i.e. one thread per physical core).
For our last set of tests, we ran \swift on the JUQUEEN IBM BlueGene/Q For our last set of tests, we ran \swift on the JUQUEEN IBM BlueGene/Q
system\footnote{\url{http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JUQUEEN/Configuration/Configuration_node.html}} system\footnote{\url{http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JUQUEEN/Configuration/Configuration_node.html}}
located at the J\"ulich Supercomputing Centre. This system is made of 28,672 located at the J\"ulich Supercomputing Centre. This system is made of 28,672
nodes consiting of an IBM PowerPC A2 processor running at $1.6~\rm{GHz}$ with nodes consisting of an IBM PowerPC A2 processor running at $1.6~\rm{GHz}$ with
each $16~\rm{GByte}$ of RAM. Of notable interest is the presence of two floating each $16~\rm{GByte}$ of RAM. Of notable interest is the presence of two floating
units per compute core. The system is composed of 28 racks containing each 1,024 units per compute core. The system is composed of 28 racks containing each 1,024
nodes. The network uses a 5D torus to link all the racks. nodes. The network uses a 5D torus to link all the racks.
...@@ -500,7 +500,7 @@ The code was compiled with the IBM XL compiler version \textsc{30.73.0.13} and ...@@ -500,7 +500,7 @@ The code was compiled with the IBM XL compiler version \textsc{30.73.0.13} and
linked to the corresponding MPI library and metis library linked to the corresponding MPI library and metis library
version \textsc{4.0.2}. version \textsc{4.0.2}.
The simulation setup with $600^3$ particles was firstrun on that system using The simulation setup with $600^3$ particles was first run on that system using
512 nodes with one MPI rank per node and variable number of threads per 512 nodes with one MPI rank per node and variable number of threads per
node. The results of this test are shown on Fig.~\ref{fig:JUQUEEN1}. node. The results of this test are shown on Fig.~\ref{fig:JUQUEEN1}.
...@@ -547,15 +547,15 @@ test are shown on Fig.~\ref{fig:JUQUEEN2}. ...@@ -547,15 +547,15 @@ test are shown on Fig.~\ref{fig:JUQUEEN2}.
\section{Conclusions} \section{Conclusions}
When running on the SuperMUC machine with 32 nodes (512 cores), each MPI rank When running on the SuperMUC machine with 32 nodes (512 cores), each MPI rank
contains approximatively $1.6\times10^7$ particles in $2.5\times10^5$ contains approximately $1.6\times10^7$ particles in $2.5\times10^5$
cells. \swift will generate around $58,000$ point-to-point asynchronous MPI cells. \swift will generate around $58,000$ point-to-point asynchronous MPI
communications (a pair of \texttt{Isend} and \texttt{Irecv}) per node every communications (a pair of \texttt{Isend} and \texttt{Irecv}) per node every
timestep. time-step.
%##################################################################################################### %#####################################################################################################
\section{Acknowledgments} \section{Acknowledgements}
This work would not have been possible without Lydia Heck's help and This work would not have been possible without Lydia Heck's help and
expertise. We thank Heinrich Bockhorst and Stephen Blair-Chappell from expertise. We thank Heinrich Bockhorst and Stephen Blair-Chappell from
{\sc intel} as well as Dirk Brommel from the J\"ulich Computing Centre {\sc intel} as well as Dirk Brommel from the J\"ulich Computing Centre
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment