EAGLE-XL run aborting with Interacting unsorted cells error.
When running examples/EAGLE_ICs/EAGLE_50
using 12 MPI ranks on 12 nodes of cosma7-rp
, we consistently get the
error:
runner_doiact_functions_hydro.h:runner_dopair_subset_branch_density():903: Interacting unsorted cells.
reported after 300-2000 steps (which takes 6-12 hours). This is seen every time with OpenMPI/GCC, but was seen once with Intel MPI 2018/GCC. The pure Intel toolchains never show this.
Capturing this once in a debugger, shows that it happening for ghost tasks that have failed to converge so are attempting to migrate up the cell tree to get more particles (odd as this is high redshift) and what happens is that from the parent cell at least one of the siblings is not sorted, this is clear from the flags. Could it be the case that not all child cells of a supercell are sorted and this walk back up the tree makes the assumption that they are.
Stack trace:
1-7,9,11 232 pthread_cond_wait@@GLIBC_2.3.2
3,7 2 runner_main (runner_main.c:407) runner_do_ghost(r, ci, 1);
3,7 2 runner_do_ghost (runner_ghost.c:1102)
3,7 2 runner_do_ghost (runner_ghost.c:1494)
3 1 runner_dosub_subset_density (runner_doiact_functions_hydro.h:2710)
3 1 runner_dosub_subset_density (runner_doiact_functions_hydro.h:2738)
3 1 runner_dosub_subset_density (runner_doiact_functions_hydro.h:2749)
3 1 runner_dopair_subset_branch_density (runner_doiact_functions_hydro.h:903)
error("Interacting unsorted cells.");