Skip to content

MPI fixes

Pedro Gonnet requested to merge mpi_fixes into master

Found and fixed the bug, can now do the following locally on cosma-a:

[nnrw56@cosma-a examples]$ mpirun -np 4 ./test_fixdt_mpi -r 100 -t 2 -g "2 2 1" -f CosmoVolume/cosmoVolume.hdf5 -m 0.705 -w 6000 -z 300 -d 1e-8
[000] main: MPI is up and running with 4 nodes.
[000] main: grid set to [ 2 2 1 ].
[000] main: maximum h set to 7.050000e-01.
[000] main: sub size set to 6000.
[000] main: split size set to 300.
[000] main: dt set to 1.000000e-08.
[000] main: sizeof(struct part) is 128 bytes.
[000] main: sizeof(struct gpart) is 64 bytes.
[000] main: Unit system: U_M = 1.000000e+00 g.
[000] main: Unit system: U_L = 1.000000e+00 cm.
[000] main: Unit system: U_t = 0.000000e+00 s.
[000] main: Unit system: U_I = 1.000000e+00 A.
[000] main: Unit system: U_T = 1.000000e+00 K.
[000] main: Density units: 1.000000e+00 a^-3.000000 h^2.000000.
[000] main: Entropy units: inf a^4.000000 h^-1.333333.
[000] main: reading particle properties took 1245.615 ms.
[000] space_regrid: h_max is 4.081e-01 (cell_max=7.050e-01).
[000] space_regrid: h_max is 4.081e-01 (cell_max=7.050e-01).
[000] space_regrid: h_max is 4.081e-01 (cell_max=7.050e-01).
[000] space_regrid: h_max is 4.081e-01 (cell_max=7.050e-01).
[000] space_regrid: set cell dimensions to [ 6 6 6 ].
[000] main: space_init took 40.547 ms.
[000] main: dt_max is 1.000000e-08.
[000] main: space dimensions are [ 6.250 6.250 6.250 ].
[000] main: space is periodic.
[000] main: highest-level cell dimensions are [ 6 6 6 ].
[000] main: 460281 parts in 216 cells.
[000] main: maximum depth is 0.
[000] space_regrid: set cell dimensions to [ 6 6 6 ].
[000] main: nr_nodes is 4.
[000] main: map_cellcheck picked up 0 parts.
[000] main: nr of cells at depth 0 is 216.
[000] main: nr_nodes is 4.
[000] space_regrid: set cell dimensions to [ 6 6 6 ].
[000] main: nr_nodes is 4.
[000] space_regrid: set cell dimensions to [ 6 6 6 ].
[000] main: nr_nodes is 4.
[002] engine_init: cpu map is [ 0 32 16 48 8 24 40 56 4 12 20 28 36 44 52 60 2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 [001] engine_init: cpu map is [ 0 32 16 48 8 24 40 56 4 12 20 28 36 44 52 60 2 6 10 14 18 22 26 30 34 38 55 57 59 61 63 ].
[000] engine_init: cpu map is [ 0 32 16 48 8 24 40 56 4 12 20 28 36 44 52 60 2 42 46 50 54 58 62 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 1 3 5 7 9 11 43 45 47 49 51 53 55 57 59 61 63 [003] engine_init: cpu map is [ 0 32 16 48 8 24 40 56 4 12 20 28 36 44 52 60 2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 ].
13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 ].
].
[000] main: engine_init took 4.417 ms.
[001] engine_redistribute: node 1 now has 381147 parts in 48 cells.
[003] engine_redistribute: node 3 now has 139728 parts in 24 cells.
[000] engine_redistribute: node 0 now has 804888 parts in 96 cells.
[002] engine_redistribute: node 2 now has 515364 parts in 48 cells.
[003] main: writing particle properties took 1152.966 ms.
[003] main: starting for 100 steps with 2 threads and 2 queues...
[000] main: writing particle properties took 1134.348 ms.
[000] main: starting for 100 steps with 2 threads and 2 queues...
# step time e_tot e_kin e_temp dt dt_step count dt_min dt_max
[002] main: writing particle properties took 1134.243 ms.
[002] main: starting for 100 steps with 2 threads and 2 queues...
[001] main: writing particle properties took 1154.141 ms.
[001] main: starting for 100 steps with 2 threads and 2 queues...
[000] space_regrid: h_max is 3.205e-01 (cell_max=0.000e+00).
[001] space_regrid: h_max is 3.205e-01 (cell_max=0.000e+00).
[002] space_regrid: h_max is 3.205e-01 (cell_max=0.000e+00).
[003] space_regrid: h_max is 3.205e-01 (cell_max=0.000e+00).
[003] engine_rebuild: task counts are [ none=0 sort=2080 self=3492 pair=14880 sub=0 ghost=24 kick1=0 kick2=24 send=144 recv=144 link=646 grav_pp=0 grav_mm=0 grav_up=0 grav_down=0 skipped=0 ]
[003] engine_rebuild: nr_parts = 139728.
[001] engine_rebuild: task counts are [ none=0 sort=5228 self=8888 pair=38558 sub=0 ghost=48 kick1=0 kick2=48 send=192 recv=204 link=1785 grav_pp=0 grav_mm=0 grav_up=0 grav_down=0 skipped=0 ]
[001] engine_rebuild: nr_parts = 381147.
[002] engine_rebuild: task counts are [ none=0 sort=7461 self=12822 pair=53006 sub=0 ghost=48 kick1=0 kick2=48 send=192 recv=192 link=2277 grav_pp=0 grav_mm=0 grav_up=0 grav_down=0 skipped=0 ]
[002] engine_rebuild: nr_parts = 515364.
[000] engine_rebuild: task counts are [ none=0 sort=11195 self=19260 pair=82086 sub=0 ghost=96 kick1=0 kick2=96 send=252 recv=240 link=3755 grav_pp=0 grav_mm=0 grav_up=0 grav_down=0 skipped=0 ]
[000] engine_rebuild: nr_parts = 804888.
0 1.000000e-08 5.1730708358525280e+06 2.2517348558944156e+06 2.9213359799581119e+06 1.000e-08 3.403e+38 0 1.201e-08 1.686e-03 0.000 334.257 1.779 55.402 1706.244 1974.563 1060.687 0.000 65049.109 55063.251 0.000 0.000 0.000 0.000 19044.610 19205.921 533.745 512.945 5.383 72369.969 72706.348
1 2.000000e-08 5.1730711787431166e+06 2.2517240292652557e+06 2.9213471494778614e+06 1.000e-08 3.403e+38 0 1.199e-08 1.686e-03 0.000 5.281 0.000 58.884 0.000 848.888 1054.040 0.000 21747.300 23349.203 0.000 0.000 0.000 0.000 0.000 39.853 3009.585 1481.495 1138.413 25084.257 30963.648
2 3.000000e-08 5.1730711972561674e+06 2.2517138635396268e+06 2.9213573337165406e+06 1.000e-08 3.403e+38 0 1.351e-08 1.686e-03 0.000 1.795 0.000 60.258 0.000 842.590 1057.486 0.000 21162.531 23289.499 0.000 0.000 0.000 0.000 0.000 39.831 1632.495 1394.155 160.481 24067.260 30685.746
3 4.000000e-08 5.1730711973641682e+06 2.2517043660566136e+06 2.9213668313075551e+06 1.000e-08 3.403e+38 0 1.451e-08 1.686e-03 0.000 1.786 0.000 57.933 0.000 839.370 1057.145 0.000 21350.334 23216.613 0.000 0.000 0.000 0.000 0.000 39.686 5506.825 2691.968 2756.773 26057.512 31147.877
4 5.000000e-08 5.1730711878426019e+06 2.2516955296040117e+06 2.9213756582385898e+06 1.000e-08 3.403e+38 0 1.551e-08 1.686e-03 0.000 1.705 0.000 59.648 0.000 832.407 1058.578 0.000 21301.543 23424.617 0.000 0.000 0.000 0.000 0.000 39.826 2924.990 1717.112 1175.071 24843.851 30848.812

All nodes now agree on the top-level grid, which is nice. This was the source of the bug in #16 (closed).

Still haven't submitted any real jobs with it, but it's getting late and I really have to get to bed.

Merge request reports