Commit 5f7ed7f4 authored by Matthieu Schaller's avatar Matthieu Schaller
Browse files

Merge branch 'repart-by-ticks-with-means' into 'master'

Repart by CPU ticks with optional fixed costs

See merge request !707
parents b7711e1f a9a55d00
......@@ -138,14 +138,15 @@ before you can build it.
=====================
- METIS:
a build of the METIS library can be optionally used to
optimize the load between MPI nodes (requires an MPI
library). This should be found in the standard installation
directories, or pointed at using the "--with-metis"
configuration option. In this case the top-level installation
directory of the METIS build should be given. Note to use
METIS you should supply at least "--with-metis".
- METIS/ParMETIS:
a build of the METIS or ParMETIS library should be used to
optimize the load between MPI nodes. This should be found in the
standard installation directories, or pointed at using the
"--with-metis" or "--with-parmetis" configuration options.
In this case the top-level installation directory of the build
should be given. Note to use METIS or ParMETIS you should supply at
least "--with-metis". ParMETIS is preferred over METIS when there
is a choice.
- libNUMA:
a build of the NUMA library can be used to pin the threads to
......
......@@ -65,8 +65,8 @@ Parameters:
-T, --timers=<int> Print timers every time-step.
-v, --verbose=<int> Run in verbose mode, in MPI mode 2 outputs
from all ranks.
-y, --task-dumps=<int> Time-step frequency at which task graphs
are dumped.
-y, --task-dumps=<int> Time-step frequency at which task analysis
files and/or tasks are dumped.
-Y, --threadpool-dumps=<int> Time-step frequency at which threadpool
tasks are dumped.
......
......@@ -113,8 +113,8 @@ Parameters:
-T, --timers=<int> Print timers every time-step.
-v, --verbose=<int> Run in verbose mode, in MPI mode 2 outputs
from all ranks.
-y, --task-dumps=<int> Time-step frequency at which task graphs
are dumped.
-y, --task-dumps=<int> Time-step frequency at which task analysis
files and/or tasks are dumped.
-Y, --threadpool-dumps=<int> Time-step frequency at which threadpool
tasks are dumped.
......
......@@ -215,7 +215,7 @@ fi
# Check if task debugging is on.
AC_ARG_ENABLE([task-debugging],
[AS_HELP_STRING([--enable-task-debugging],
[Store task timing information and generate task dump files @<:@yes/no@:>@]
[Store extra information for generating task dump files @<:@yes/no@:>@]
)],
[enable_task_debugging="$enableval"],
[enable_task_debugging="no"]
......
......@@ -1976,7 +1976,7 @@ SEARCH_INCLUDES = YES
# preprocessor.
# This tag requires that the tag SEARCH_INCLUDES is set to YES.
INCLUDE_PATH =
INCLUDE_PATH = @top_srcdir@/src
# You can use the INCLUDE_FILE_PATTERNS tag to specify one or more wildcard
# patterns (like *.h and *.hpp) to filter out the header-files in the
......
......@@ -60,7 +60,7 @@ can be found by typing ``./swift -h``::
-T, --timers=<int> Print timers every time-step.
-v, --verbose=<int> Run in verbose mode, in MPI mode 2 outputs
from all ranks.
-y, --task-dumps=<int> Time-step frequency at which task graphs
are dumped.
-y, --task-dumps=<int> Time-step frequency at which task analysis
files and/or tasks are dumped.
-Y, --threadpool-dumps=<int> Time-step frequency at which threadpool
tasks are dumped.
......@@ -41,9 +41,9 @@ FFTW
~~~~
Version 3.3.x or higher is required for periodic gravity.
METIS
~~~~~
METIS is used for domain decomposition and load balancing.
ParMETIS or METIS
~~~~~~~~~~~~~~~~~
One is required for domain decomposition and load balancing.
libNUMA
~~~~~~~
......
......@@ -26,9 +26,8 @@ system (i.e. over MPI on several nodes). Here are some recommendations:
+ Run with threads pinned. You can do this by passing the ``-a`` flag to the
SWIFT binary. This ensures that processes stay on the same core that spawned
them, ensuring that cache is accessed more efficiently.
+ Ensure that you compile with METIS. More information is available in an
upcoming paper, but using METIS allows for work to be distributed in a
more efficient way between your nodes.
+ Ensure that you compile with ParMETIS or METIS. These are required if
want to load balance between MPI ranks.
Your batch script should look something like the following (to run on 16 nodes
each with 2x16 core processors for a total of 512 cores):
......
......@@ -15,4 +15,3 @@ parameter files.
parameter_description
output_selection
......@@ -458,7 +458,7 @@ using the parameter:
The default level of ``0`` implies no compression and values have to be in the
range :math:`[0-9]`. This integer is passed to the i/o library and used for the
lossless GZIP compression algorithm. Higher values imply higher compression but
loss-less GZIP compression algorithm. Higher values imply higher compression but
also more time spent deflating and inflating the data. Note that up until HDF5
1.10.x this option is not available when using the MPI-parallel version of the
i/o routines.
......@@ -500,7 +500,7 @@ Showing all the parameters for a basic hydro test-case, one would have:
int_time_label_on: 0
compression: 3
UnitLength_in_cgs: 1. # Use cm in outputs
UnitMass_in_cgs: 1. # Use grams in outpus
UnitMass_in_cgs: 1. # Use grams in outputs
UnitVelocity_in_cgs: 1. # Use cm/s in outputs
UnitCurrent_in_cgs: 1. # Use Ampere in outputs
UnitTemp_in_cgs: 1. # Use Kelvin in outputs
......@@ -611,8 +611,212 @@ Scheduler
.. _Parameters_domain_decomposition:
Domain Decomposition
--------------------
Domain Decomposition:
---------------------
This section determines how the top-level cells are distributed between the
ranks of an MPI run. An ideal decomposition should result in each rank having
a similar amount of work to do, so that all the ranks complete at the same
time. Achieving a good balance requires that SWIFT is compiled with either the
ParMETIS or METIS libraries. ParMETIS is an MPI version of METIS, so is
preferred for performance reasons.
When we use ParMETIS/METIS the top-level cells of the volume are considered as
a graph, with a cell at each vertex and edges that connect the vertices to all
the neighbouring cells (so we have 26 edges connected to each vertex).
Decomposing such a graph into domains is known as partitioning, so in SWIFT we
refer to domain decomposition as partitioning.
This graph of cells can have weights associated with the vertices and the
edges. These weights are then used to guide the partitioning, seeking to
balance the total weight of the vertices and minimize the weights of the edges
that are cut by the domain boundaries (known as the edgecut). We can consider
the edge weights as a proxy for the exchange of data between cells, so
minimizing this reduces communication.
The Initial Partition:
^^^^^^^^^^^^^^^^^^^^^^
When SWIFT first starts it reads the initial conditions and then does an
initial distribution of the top-level cells. At this time the only information
available is the cell structure and, by geometry, the particles each cell
should contain. The type of partitioning attempted is controlled by the::
DomainDecomposition:
initial_type:
parameter. Which can have the values *memory*, *region*, *grid* or
*vectorized*:
* *memory*
This is the default if METIS or ParMETIS is available. It performs a
partition based on the memory use of all the particles in each cell,
attempting to equalize the memory used by all the ranks.
How successful this attempt is depends on the granularity of cells and particles
and the number of ranks, clearly if most of the particles are in one cell,
or a small region of the volume, balance is impossible or
difficult. Having more top-level cells makes it easier to calculate a
good distribution (but this comes at the cost of greater overheads).
* *region*
The one other METIS/ParMETIS option is "region". This attempts to assign equal
numbers of cells to each rank, with the surface area of the regions minimised
(so we get blobs, rather than rectangular volumes of cells).
If ParMETIS and METIS are not available two other options are possible, but
will give a poorer partition:
* *grid*
Split the cells into a number of axis aligned regions. The number of
splits per axis is controlled by the::
initial_grid
parameter. It takes an array of three values. The product of these values
must equal the number of MPI ranks. If not set a suitable default will be used.
* *vectorized*
Allocate the cells on the basis of proximity to a set of seed
positions. The seed positions are picked every nranks along a vectorized
cell list (1D representation). This is guaranteed to give an initial
partition for all cases when the number of cells is greater equal to the
number of MPI ranks, so can be used if the others fail. Don't use this.
If ParMETIS and METIS are not available then only an initial partition will be
performed. So the balance will be compromised by the quality of the initial
partition.
Repartitioning:
^^^^^^^^^^^^^^^
When ParMETIS or METIS is available we can consider adjusting the balance
during the run, so we can improve from the initial partition and also track
changes in the run that require a different balance. The initial partition is
usually not optimal as although it may have balanced the distribution of
particles it has not taken account of the fact that different particles types
require differing amounts of processing and we have not considered that we
also need to do work requiring communication between cells. This latter point
is important as we are running an MPI job, as inter-cell communication may be
very expensive.
There are a number of possible repartition strategies which are defined using
the::
DomainDecomposition:
repartition_type:
parameter. The possible values for this are *none*, *fullcosts*, *edgecosts*,
*memory*, *timecosts*.
* *none*
Rather obviously, don't repartition. You are happy to run with the
initial partition.
* *fullcosts*
Use computation weights derived from the running tasks for the vertex and
edge weights. This is the default.
* *edgecosts*
Only use computation weights derived from the running tasks for the edge
weights.
* *memory*
Repeat the initial partition with the current particle positions
re-balancing the memory use.
* *timecosts*
Only use computation weights derived from the running tasks for the vertex
weights and the expected time the particles will interact in the cells as
the edge weights. Using time as the edge weight has the effect of keeping
very active cells on single MPI ranks, so can reduce MPI communication.
The computation weights are actually the measured times, in CPU ticks, that
tasks associated with a cell take. So these automatically reflect the relative
cost of the different task types (SPH, self-gravity etc.), and other factors
like how well they run on the current hardware and are optimized by the
compiler used, but this means that we have a constraint on how often we can
consider repartitioning, namely when all (or nearly all) the tasks of the
system have been invoked in a step. To control this we have the::
minfrac: 0.9
parameter. Which defines the minimum fraction of all the particles in the
simulation that must have been actively updated in the last step, before
repartitioning is considered.
That then leaves the question of when a run is considered to be out of balance
and should benefit from a repartition. That is controlled by the::
trigger: 0.05
parameter. This value is the CPU time difference between MPI ranks, as a
fraction, if less than this value a repartition will not be
done. Repartitioning can be expensive not just in CPU time, but also because
large numbers of particles can be exchanged between MPI ranks, so is best
avoided.
If you are using ParMETIS there additional ways that you can tune the
repartition process.
METIS only offers the ability to create a partition from a graph, which means
that each solution is independent of those that have already been made, that
can make the exchange of particles very large (although SWIFT attempts to
minimize this), however, using ParMETIS we can use the existing partition to
inform the new partition, this has two algorithms that are controlled using::
adaptive: 1
which means use adaptive repartition, otherwise simple refinement. The
adaptive algorithm is further controlled by the::
itr: 100
parameter, which defines the ratio of inter node communication time to data
redistribution time, in the range 0.00001 to 10000000.0. Lower values give
less data movement during redistributions. The best choice for these can only
be determined by experimentation (the gains are usually small, so not really
recommended).
Finally we have the parameter::
usemetis: 0
Forces the use of the METIS API, probably only useful for developers.
**Fixed cost repartitioning:**
So far we have assumed that repartitioning will only happen after a step that
meets the `minfrac:` and `trigger:` criteria, but we may want to repartition
at some arbitrary steps, and indeed do better than the initial partition
earlier in the run. This can be done using *fixed cost* repartitioning.
Fixed costs are output during each repartition step into the file
`partition_fixed_costs.h`, this should be created by a test run of your your
full simulation (with possibly with a smaller volume, but all the physics
enabled). This file can then be used to replace the same file found in the
`src/` directory and SWIFT should then be recompiled. Once you have that, you
can use the parameter::
use_fixed_costs: 1
to control whether they are used or not. If enabled these will be used to
repartition after the second step, which will generally give as good a
repartition immediately as you get at the first unforced repartition.
Also once these have been enabled you can change the `trigger:` value to
numbers greater than 2, and repartitioning will be forced every `trigger`
steps. This latter option is probably only useful for developers, but tuning
the second step to use fixed costs can give some improvements.
.. _Parameters_structure_finding:
......
......@@ -290,10 +290,12 @@ int main(int argc, char *argv[]) {
#ifndef SWIFT_DEBUG_TASKS
if (dump_tasks) {
printf(
"Error: task dumping is only possible if SWIFT was configured"
" with the --enable-task-debugging option.\n");
return 1;
if (myrank == 0) {
message(
"WARNING: complete task dumps are only created when "
"configured with --enable-task-debugging.");
message(" Basic task statistics will be output.");
}
}
#endif
......@@ -503,8 +505,10 @@ int main(int argc, char *argv[]) {
message("Using METIS serial partitioning:");
else
message("Using ParMETIS partitioning:");
#else
#elif defined(HAVE_METIS)
message("Using METIS serial partitioning:");
#else
message("Non-METIS partitioning:");
#endif
message(" initial partitioning: %s",
initial_partition_name[initial_partition.type]);
......@@ -1040,99 +1044,17 @@ int main(int argc, char *argv[]) {
if (force_stop || (e.restart_onexit && e.step - 1 == nsteps))
engine_dump_restarts(&e, 0, 1);
#ifdef SWIFT_DEBUG_TASKS
/* Dump the task data using the given frequency. */
if (dump_tasks && (dump_tasks == 1 || j % dump_tasks == 1)) {
#ifdef WITH_MPI
/* Make sure output file is empty, only on one rank. */
char dumpfile[35];
snprintf(dumpfile, sizeof(dumpfile), "thread_info_MPI-step%d.dat", j + 1);
FILE *file_thread;
if (myrank == 0) {
file_thread = fopen(dumpfile, "w");
fclose(file_thread);
}
MPI_Barrier(MPI_COMM_WORLD);
for (int i = 0; i < nr_nodes; i++) {
/* Rank 0 decides the index of writing node, this happens one-by-one. */
int kk = i;
MPI_Bcast(&kk, 1, MPI_INT, 0, MPI_COMM_WORLD);
if (i == myrank) {
/* Open file and position at end. */
file_thread = fopen(dumpfile, "a");
fprintf(file_thread,
" %03d 0 0 0 0 %lld %lld %lld %lld %lld 0 0 %lld\n", myrank,
e.tic_step, e.toc_step, e.updates, e.g_updates, e.s_updates,
cpufreq);
int count = 0;
for (int l = 0; l < e.sched.nr_tasks; l++) {
if (!e.sched.tasks[l].implicit && e.sched.tasks[l].toc != 0) {
fprintf(file_thread,
" %03i %i %i %i %i %lli %lli %i %i %i %i %lli %i\n",
myrank, e.sched.tasks[l].rid, e.sched.tasks[l].type,
e.sched.tasks[l].subtype, (e.sched.tasks[l].cj == NULL),
e.sched.tasks[l].tic, e.sched.tasks[l].toc,
(e.sched.tasks[l].ci != NULL)
? e.sched.tasks[l].ci->hydro.count
: 0,
(e.sched.tasks[l].cj != NULL)
? e.sched.tasks[l].cj->hydro.count
: 0,
(e.sched.tasks[l].ci != NULL)
? e.sched.tasks[l].ci->grav.count
: 0,
(e.sched.tasks[l].cj != NULL)
? e.sched.tasks[l].cj->grav.count
: 0,
e.sched.tasks[l].flags, e.sched.tasks[l].sid);
}
fflush(stdout);
count++;
}
fclose(file_thread);
}
/* And we wait for all to synchronize. */
MPI_Barrier(MPI_COMM_WORLD);
}
#ifdef SWIFT_DEBUG_TASKS
task_dump_all(&e, j + 1);
#endif
#else
char dumpfile[32];
snprintf(dumpfile, sizeof(dumpfile), "thread_info-step%d.dat", j + 1);
FILE *file_thread;
file_thread = fopen(dumpfile, "w");
/* Add some information to help with the plots */
fprintf(file_thread, " %d %d %d %d %lld %lld %lld %lld %lld %d %lld\n",
-2, -1, -1, 1, e.tic_step, e.toc_step, e.updates, e.g_updates,
e.s_updates, 0, cpufreq);
for (int l = 0; l < e.sched.nr_tasks; l++) {
if (!e.sched.tasks[l].implicit && e.sched.tasks[l].toc != 0) {
fprintf(
file_thread, " %i %i %i %i %lli %lli %i %i %i %i %i\n",
e.sched.tasks[l].rid, e.sched.tasks[l].type,
e.sched.tasks[l].subtype, (e.sched.tasks[l].cj == NULL),
e.sched.tasks[l].tic, e.sched.tasks[l].toc,
(e.sched.tasks[l].ci == NULL) ? 0
: e.sched.tasks[l].ci->hydro.count,
(e.sched.tasks[l].cj == NULL) ? 0
: e.sched.tasks[l].cj->hydro.count,
(e.sched.tasks[l].ci == NULL) ? 0
: e.sched.tasks[l].ci->grav.count,
(e.sched.tasks[l].cj == NULL) ? 0
: e.sched.tasks[l].cj->grav.count,
e.sched.tasks[l].sid);
}
}
fclose(file_thread);
#endif // WITH_MPI
/* Generate the task statistics. */
char dumpfile[40];
snprintf(dumpfile, 40, "thread_stats-step%d.dat", j + 1);
task_dump_stats(dumpfile, &e, /* header = */ 0, /* allranks = */ 1);
}
#endif // SWIFT_DEBUG_TASKS
#ifdef SWIFT_DEBUG_THREADPOOL
/* Dump the task data using the given frequency. */
......
......@@ -145,23 +145,24 @@ Restarts:
# Parameters governing domain decomposition
DomainDecomposition:
initial_type: memory # (Optional) The initial decomposition strategy: "grid",
# "region", "memory", or "vectorized".
initial_grid: [10,10,10] # (Optional) Grid sizes if the "grid" strategy is chosen.
initial_type: memory # (Optional) The initial decomposition strategy: "grid",
# "region", "memory", or "vectorized".
initial_grid: [10,10,10] # (Optional) Grid sizes if the "grid" strategy is chosen.
repartition_type: costs/costs # (Optional) The re-decomposition strategy, one of:
# "none/none", "costs/costs", "none/costs", "costs/none" or "costs/time".
# These are vertex/edge weights with "costs" as task timing
# and "time" as the expected time of the next updates
trigger: 0.05 # (Optional) Fractional (<1) CPU time difference between MPI ranks required to trigger a
# new decomposition, or number of steps (>1) between decompositions
minfrac: 0.9 # (Optional) Fractional of all particles that should be updated in previous step when
# using CPU time trigger
usemetis: 0 # Use serial METIS when ParMETIS is also available.
adaptive: 1 # Use adaptive repartition when ParMETIS is available, otherwise simple refinement.
itr: 100 # When adaptive defines the ratio of inter node communication time to data redistribution time, in the range 0.00001 to 10000000.0.
# Lower values give less data movement during redistributions, at the cost of global balance which may require more communication.
repartition_type: fullcosts # (Optional) The re-decomposition strategy, one of:
# "none", "fullcosts", "edgecosts", "memory" or
# "timecosts".
trigger: 0.05 # (Optional) Fractional (<1) CPU time difference between MPI ranks required to trigger a
# new decomposition, or number of steps (>1) between decompositions
minfrac: 0.9 # (Optional) Fractional of all particles that should be updated in previous step when
# using CPU time trigger
usemetis: 0 # Use serial METIS when ParMETIS is also available.
adaptive: 1 # Use adaptive repartition when ParMETIS is available, otherwise simple refinement.
itr: 100 # When adaptive defines the ratio of inter node communication time to data redistribution time, in the range 0.00001 to 10000000.0.
# Lower values give less data movement during redistributions, at the cost of global balance which may require more communication.
use_fixed_costs: 1 # If 1 then use any compiled in fixed costs for
# task weights in first repartition, if 0 only use task timings, if > 1 only use
# fixed costs, unless none are available.
# Structure finding options (requires velociraptor)
StructureFinding:
......
......@@ -41,8 +41,8 @@ endif
# List required headers
include_HEADERS = space.h runner.h queue.h task.h lock.h cell.h part.h const.h \
engine.h swift.h serial_io.h timers.h debug.h scheduler.h proxy.h parallel_io.h \
common_io.h single_io.h multipole.h map.h tools.h partition.h clocks.h parser.h \
physical_constants.h physical_constants_cgs.h potential.h version.h \
common_io.h single_io.h multipole.h map.h tools.h partition.h partition_fixed_costs.h \
clocks.h parser.h physical_constants.h physical_constants_cgs.h potential.h version.h \
hydro_properties.h riemann.h threadpool.h cooling_io.h cooling.h cooling_struct.h \
statistics.h memswap.h cache.h runner_doiact_vec.h profiler.h \
dump.h logger.h active.h timeline.h xmf.h gravity_properties.h gravity_derivatives.h \
......
......@@ -993,6 +993,12 @@ void engine_repartition(struct engine *e) {
* bug that doesn't handle this case well. */
if (e->nr_nodes == 1) return;
/* Generate the fixed costs include file. */
if (e->step > 3 && e->reparttype->trigger <= 1.f) {
task_dump_stats("partition_fixed_costs.h", e, /* header = */ 1,
/* allranks = */ 1);
}
/* Do the repartitioning. */
partition_repartition(e->reparttype, e->nodeID, e->nr_nodes, e->s,
e->sched.tasks, e->sched.nr_tasks);
......@@ -1048,32 +1054,41 @@ void engine_repartition_trigger(struct engine *e) {
const ticks tic = getticks();
/* Do nothing if there have not been enough steps since the last
* repartition, don't want to repeat this too often or immediately after
* a repartition step. Also nothing to do when requested. */
/* Do nothing if there have not been enough steps since the last repartition
* as we don't want to repeat this too often or immediately after a
* repartition step. Also nothing to do when requested. */
if (e->step - e->last_repartition >= 2 &&
e->reparttype->type != REPART_NONE) {
/* Old style if trigger is >1 or this is the second step (want an early
* repartition following the initial repartition). */
if (e->reparttype->trigger > 1 || e->step == 2) {
/* If we have fixed costs available and this is step 2 or we are forcing
* repartitioning then we do a fixed costs one now. */
if (e->reparttype->trigger > 1 ||
(e->step == 2 && e->reparttype->use_fixed_costs)) {
if (e->reparttype->trigger > 1) {
if ((e->step % (int)e->reparttype->trigger) == 0) e->forcerepart = 1;
} else {
e->forcerepart = 1;
}
e->reparttype->use_ticks = 0;
} else {
/* Use cputimes from ranks to estimate the imbalance. */
/* First check if we are going to skip this stage anyway, if so do that
* now. If is only worth checking the CPU loads when we have processed a
* significant number of all particles. */
/* It is only worth checking the CPU loads when we have processed a
* significant number of all particles as we require all tasks to have
* timings. */
if ((e->updates > 1 &&
e->updates >= e->total_nr_parts * e->reparttype->minfrac) ||
(e->g_updates > 1 &&
e->g_updates >= e->total_nr_gparts * e->reparttype->minfrac)) {
/* Should we are use the task timings or fixed costs. */
if (e->reparttype->use_fixed_costs > 1) {
e->reparttype->use_ticks = 0;
} else {
e->reparttype->use_ticks = 1;
}
/* Get CPU time used since the last call to this function. */
double elapsed_cputime =
clocks_get_cputime_used() - e->cputime_last_step;
......@@ -1096,17 +1111,22 @@ void engine_repartition_trigger(struct engine *e) {
double mean = sum / (double)e->nr_nodes;
/* Are we out of balance? */
if (((maxtime - mintime) / mean) > e->reparttype->trigger) {
double abs_trigger = fabs(e->reparttype->trigger);
if (((maxtime - mintime) / mean) > abs_trigger) {
if (e->verbose)
message("trigger fraction %.3f exceeds %.3f will repartition",
(maxtime - mintime) / mintime, e->reparttype->trigger);
message("trigger fraction %.3f > %.3f will repartition",