Rebuild can not happen (and thus neighbours lost) if a cell has no pair tasks.
So this only happens if there's a void between 2 highly dense areas or something (so more likely in engineering SPH, or may never happen in a realistic testcase, I'm not clear).
My first naive solution to this problem (which works but is mega slow) was to just run:
__attribute__((always_inline)) INLINE static int
cell_need_rebuild_for_hydro_self(const struct cell *c){
int res = 0;
for(int i = 0; i < c->hydro.count; i++){
struct part *p = &c->hydro.parts[i];
if(p->x[0] <= c->loc[0] || p->x[0] >= c->loc[0] + c->width[0] || p->x[1] <= c->loc[1] || p->x[1] >= c->loc[1] + c->width[1] || p->x[2] <= c->loc[2] || p->x[2] >= c->loc[2] + c->width[2]){
res = 1;
break;
}
}
return res;
}
on every cell, regardless of what it does. I did initially try checking if the maximum movement of a particle was <= gamma*h_max
but that didn't work (but maybe due to the issue discussed below).
My improved idea, is to only run that function on cells that have no hydro pairs, but seemingly with that fix and high numbers of cells (140^3 maximum, 1.5m cube box with a 2.5*h of 0.005m, so in principle I could support 300^3) my current test never rebuilds, which based on the movement of the particles seems unlikely, though I'm still early during the run and maybe it does later.
I guess what I need to be able to show is that this criteria is sufficient to always find all neighbours regardless of particle/cell structure, and then check I implement all the required data correctly.
MY concern is that by disabling the drift, and setting
xp->x_diff[k] -= dx;
xp->x_diff_sort[k] -= dx;
only in kick.h, and only happens on kick2, these quantities might not work/ever be updated in the cell. In fact I'm almost certain that they won't without delving too deeply into the code, it appears to me c->hydro.h_max
and friends is only updated in drift_part, but I'm not sure when xp->x_diff
and co are reset (if ever? grep wasn't my friend), but even at best this will be delayed by 1 step (if my memory serves correctly and kick2 happens later in the step than the drift)?
I'll have a think when less ill/busy about trying to prove the former is sufficient, but in the mean time @matthieu is there a good way I can try to solve the latter issue that I think may be happening?
Edit: The code did rebuild after 8218 steps, which for that number of cells took 28.3s (2 node sim), and saves 0.09s per step over rebuilding every step (which can happen with fewer cells), which based on my maths saves alot*s for this section of the simulation, however from this point the rebuilds are occuring every 3000 steps. On a single node the first rebuild occurs at a different step count (8717, and different simulation time) which shouldn't happen I think?, but the rebuilds take 1s and save 0.09s per step. For this simulation (200k parts) I save no time for non-rebuild steps with 2 node + 32 shared-mem threads/node, and the rebuilds are 28x slower (due to data movement I assume). I think the latter thing is still an issue as they should do the same physics and thus rebuild at at least a very similar simulation time. One thing I'm unclear on is if I Allreduce the timestep between the nodes or not - I think I decided I needed to so that shouldn't be an issue, but I'll double check.