Commit 0a65a95b authored by Matthieu Schaller's avatar Matthieu Schaller
Browse files

Merge branch 'mpi_fixes' into 'master'

MPI fixes

Found and fixed the bug, can now do the following locally on `cosma-a`:

```
[nnrw56@cosma-a examples]$ mpirun -np 4 ./test_fixdt_mpi -r 100 -t 2 -g "2 2 1" -f CosmoVolume/cosmoVolume.hdf5 -m 0.705 -w 6000 -z 300 -d 1e-8
[000] main: MPI is up and running with 4 nodes.
[000] main: grid set to [ 2 2 1 ].
[000] main: maximum h set to 7.050000e-01.
[000] main: sub size set to 6000.
[000] main: split size set to 300.
[000] main: dt set to 1.000000e-08.
[000] main: sizeof(struct part) is 128 bytes.
[000] main: sizeof(struct gpart) is 64 bytes.
[000] main: Unit system: U_M = 1.000000e+00 g.
[000] main: Unit system: U_L = 1.000000e+00 cm.
[000] main: Unit system: U_t = 0.000000e+00 s.
[000] main: Unit system: U_I = 1.000000e+00 A.
[000] main: Unit system: U_T = 1.000000e+00 K.
[000] main: Density units: 1.000000e+00 a^-3.000000 h^2.000000.
[000] main: Entropy units: inf a^4.000000 h^-1.333333.
[000] main: reading particle properties took 1245.615 ms.
[000] space_regrid: h_max is 4.081e-01 (cell_max=7.050e-01).
[000] space_regrid: h_max is 4.081e-01 (cell_max=7.050e-01).
[000] space_regrid: h_max is 4.081e-01 (cell_max=7.050e-01).
[000] space_regrid: h_max is 4.081e-01 (cell_max=7.050e-01).
[000] space_regrid: set cell dimensions to [ 6 6 6 ].
[000] main: space_init took 40.547 ms.
[000] main: dt_max is 1.000000e-08.
[000] main: space dimensions are [ 6.250 6.250 6.250 ].
[000] main: space is periodic.
[000] main: highest-level cell dimensions are [ 6 6 6 ].
[000] main: 460281 parts in 216 cells.
[000] main: maximum depth is 0.
[000] space_regrid: set cell dimensions to [ 6 6 6 ].
[000] main: nr_nodes is 4.
[000] main: map_cellcheck picked up 0 parts.
[000] main: nr of cells at depth 0 is 216.
[000] main: nr_nodes is 4.
[000] space_regrid: set cell dimensions to [ 6 6 6 ].
[000] main: nr_nodes is 4.
[000] space_regrid: set cell dimensions to [ 6 6 6 ].
[000] main: nr_nodes is 4.
[002] engine_init: cpu map is [ 0 32 16 48 8 24 40 56 4 12 20 28 36 44 52 60 2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 [001] engine_init: cpu map is [ 0 32 16 48 8 24 40 56 4 12 20 28 36 44 52 60 2 6 10 14 18 22 26 30 34 38 55 57 59 61 63 ].
[000] engine_init: cpu map is [ 0 32 16 48 8 24 40 56 4 12 20 28 36 44 52 60 2 42 46 50 54 58 62 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 1 3 5 7 9 11 43 45 47 49 51 53 55 57 59 61 63 [003] engine_init: cpu map is [ 0 32 16 48 8 24 40 56 4 12 20 28 36 44 52 60 2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 ].
13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 ].
].
[000] main: engine_init took 4.417 ms.
[001] engine_redistribute: node 1 now has 381147 parts in 48 cells.
[003] engine_redistribute: node 3 now has 139728 parts in 24 cells.
[000] engine_redistribute: node 0 now has 804888 parts in 96 cells.
[002] engine_redistribute: node 2 now has 515364 parts in 48 cells.
[003] main: writing particle properties took 1152.966 ms.
[003] main: starting for 100 steps with 2 threads and 2 queues...
[000] main: writing particle properties took 1134.348 ms.
[000] main: starting for 100 steps with 2 threads and 2 queues...
# step time e_tot e_kin e_temp dt dt_step count dt_min dt_max
[002] main: writing particle properties took 1134.243 ms.
[002] main: starting for 100 steps with 2 threads and 2 queues...
[001] main: writing particle properties took 1154.141 ms.
[001] main: starting for 100 steps with 2 threads and 2 queues...
[000] space_regrid: h_max is 3.205e-01 (cell_max=0.000e+00).
[001] space_regrid: h_max is 3.205e-01 (cell_max=0.000e+00).
[002] space_regrid: h_max is 3.205e-01 (cell_max=0.000e+00).
[003] space_regrid: h_max is 3.205e-01 (cell_max=0.000e+00).
[003] engine_rebuild: task counts are [ none=0 sort=2080 self=3492 pair=14880 sub=0 ghost=24 kick1=0 kick2=24 send=144 recv=144 link=646 grav_pp=0 grav_mm=0 grav_up=0 grav_down=0 skipped=0 ]
[003] engine_rebuild: nr_parts = 139728.
[001] engine_rebuild: task counts are [ none=0 sort=5228 self=8888 pair=38558 sub=0 ghost=48 kick1=0 kick2=48 send=192 recv=204 link=1785 grav_pp=0 grav_mm=0 grav_up=0 grav_down=0 skipped=0 ]
[001] engine_rebuild: nr_parts = 381147.
[002] engine_rebuild: task counts are [ none=0 sort=7461 self=12822 pair=53006 sub=0 ghost=48 kick1=0 kick2=48 send=192 recv=192 link=2277 grav_pp=0 grav_mm=0 grav_up=0 grav_down=0 skipped=0 ]
[002] engine_rebuild: nr_parts = 515364.
[000] engine_rebuild: task counts are [ none=0 sort=11195 self=19260 pair=82086 sub=0 ghost=96 kick1=0 kick2=96 send=252 recv=240 link=3755 grav_pp=0 grav_mm=0 grav_up=0 grav_down=0 skipped=0 ]
[000] engine_rebuild: nr_parts = 804888.
0 1.000000e-08 5.1730708358525280e+06 2.2517348558944156e+06 2.9213359799581119e+06 1.000e-08 3.403e+38 0 1.201e-08 1.686e-03 0.000 334.257 1.779 55.402 1706.244 1974.563 1060.687 0.000 65049.109 55063.251 0.000 0.000 0.000 0.000 19044.610 19205.921 533.745 512.945 5.383 72369.969 72706.348
1 2.000000e-08 5.1730711787431166e+06 2.2517240292652557e+06 2.9213471494778614e+06 1.000e-08 3.403e+38 0 1.199e-08 1.686e-03 0.000 5.281 0.000 58.884 0.000 848.888 1054.040 0.000 21747.300 23349.203 0.000 0.000 0.000 0.000 0.000 39.853 3009.585 1481.495 1138.413 25084.257 30963.648
2 3.000000e-08 5.1730711972561674e+06 2.2517138635396268e+06 2.9213573337165406e+06 1.000e-08 3.403e+38 0 1.351e-08 1.686e-03 0.000 1.795 0.000 60.258 0.000 842.590 1057.486 0.000 21162.531 23289.499 0.000 0.000 0.000 0.000 0.000 39.831 1632.495 1394.155 160.481 24067.260 30685.746
3 4.000000e-08 5.1730711973641682e+06 2.2517043660566136e+06 2.9213668313075551e+06 1.000e-08 3.403e+38 0 1.451e-08 1.686e-03 0.000 1.786 0.000 57.933 0.000 839.370 1057.145 0.000 21350.334 23216.613 0.000 0.000 0.000 0.000 0.000 39.686 5506.825 2691.968 2756.773 26057.512 31147.877
4 5.000000e-08 5.1730711878426019e+06 2.2516955296040117e+06 2.9213756582385898e+06 1.000e-08 3.403e+38 0 1.551e-08 1.686e-03 0.000 1.705 0.000 59.648 0.000 832.407 1058.578 0.000 21301.543 23424.617 0.000 0.000 0.000 0.000 0.000 39.826 2924.990 1717.112 1175.071 24843.851 30848.812
```

All nodes now agree on the top-level grid, which is nice. This was the source of the bug in #16.

Still haven't submitted any real jobs with it, but it's getting late and I really have to get to bed.

See merge request !7


Former-commit-id: 0eb1f7b3cf45940913213244ef745b1c50be74f3
parents e1b324fd ec6cf7c8
......@@ -185,8 +185,18 @@ void space_regrid ( struct space *s , double cell_max ) {
}
s->h_max = h_max;
}
// message( "h_max is %.3e (cell_max=%.3e)." , h_max , cell_max );
// message( "getting h_min and h_max took %.3f ms." , (double)(getticks() - tic) / CPU_TPS * 1000 );
/* If we are running in parallel, make sure everybody agrees on
how large the largest cell should be. */
#ifdef WITH_MPI
{
float buff;
if ( MPI_Allreduce( &h_max , &buff , 1 , MPI_FLOAT , MPI_MAX , MPI_COMM_WORLD ) != MPI_SUCCESS )
error( "Failed to aggreggate the rebuild flag accross nodes." );
h_max = buff;
}
#endif
message( "h_max is %.3e (cell_max=%.3e)." , h_max , cell_max );
/* Get the new putative cell dimensions. */
for ( k = 0 ; k < 3 ; k++ )
......@@ -497,7 +507,7 @@ void parts_sort ( struct part *parts , struct xpart *xparts , int *ind , int N ,
first = 0; last = 1; waiting = 1;
/* Parallel bit. */
#pragma omp parallel default(none) shared(N,first,last,waiting,qstack,parts,xparts,ind,qstack_size,stderr,engine_rank) private(pivot,i,ii,j,jj,min,max,temp_i,qid,temp_xp,temp_p)
#pragma omp parallel default(shared) shared(N,first,last,waiting,qstack,parts,xparts,ind,qstack_size,stderr,engine_rank) private(pivot,i,ii,j,jj,min,max,temp_i,qid,temp_xp,temp_p)
{
/* Main loop. */
......@@ -657,7 +667,7 @@ void gparts_sort ( struct gpart *gparts , int *ind , int N , int min , int max )
first = 0; last = 1; waiting = 1;
/* Parallel bit. */
#pragma omp parallel default(none) shared(N,first,last,waiting,qstack,gparts,ind,qstack_size,stderr,engine_rank) private(pivot,i,ii,j,jj,min,max,temp_i,qid,temp_p)
#pragma omp parallel default(shared) shared(N,first,last,waiting,qstack,gparts,ind,qstack_size,stderr,engine_rank) private(pivot,i,ii,j,jj,min,max,temp_i,qid,temp_p)
{
/* Main loop. */
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment