Neighbour Counting Inconsistencies in hydro-hydro interaction tasks for RT gradient/transport tasks
I'm encountering issues with !1213 (merged), getting towards my wit's end, and hoping for some help here.
The idea of the merge request is to implement two new task groups for the radiative transfer that are to cover the computation of the gradient and the transport step of the RT method.
The gradient step to the extra hydro loop, and does essentially the same, but with non-hydro quantities. So I use the non-symmetric self1/pair1
type of runner functions.
The transport step corresponds to the force step in sph/meshless. Hence I use the symmetric self2/pair2
type of runner functions.
I use the densities and smoothing lengths as determined by the hydro part of the time step, and they are not changed/touched during the RT part of a time step.
A check that I am performing is to count all interactions of each particle during the gradient interaction and during the transport interaction individually. My expectation is that the number of particle-particle interactions during the symmetric transport interaction must be >= the number of non-symmetric interactions for each particle. But this isn't the case for all particles.
I compile my branch with
--disable-mpi --disable-doxygen-doc --enable-cell-graph --disable-hand-vec --with-rt=debug --with-hydro=gadget2 --with-stars=GEAR --with-feedback=none --enable-debug --enable-debugging-checks
With this merge request, I allow the code to run with --feedback
even when compiled --with-feedback=none
so the stars' smoothing lenghts get computed.
I run the IsolatedGalaxy_feedback
example with
../../swiftsim/examples/swift \
--external-gravity \
--self-gravity \
--hydro \
--threads=4 \
--steps=50 \
--limiter \
--sync \
--stars \
--radiation \
--cell-dumps=1 \
--feedback \
isolated_galaxy.yml 2>&1 | tee output.log
I have noticed the following:
- which particles have more gradient calls than transport calls varies from run to run
- the number of counts is fine (=according to my expectations) when running on 1 thread only
- the number of counts is fine if I run without the
--feedback
flag.
For debugging purposes, I have been counting how many tasks get created in engine_maketasks
and (re)activated in engine_marktasks
and cell_unskip
. I also differentiate between self, sub self, pair, and sub pair (shortened as S
, SS
, P
, and SP
in the tables below). LA
stands for "link added" in engine_maketasks
using engine_addlink()
, LW
stands for "Link Walked" in cell_unskip
, where the tasks are unskipped by looping through the linked list. In both cases, I just count how many times that happened.
Here is some example output for cells with non-zero RT gradient/transport tasks:
CellID || gradient || transport ||
|| created | cell_unskip | engine_marktask | || created | cell_unskip | engine_marktask | ||
-------------------------------------------------------------------------------------------------------------------------------------------------------
|| S SS P SP | S SS P SP | S SS P SP | LA LW || S SS P SP | S SS P SP | S SS P SP | LA LW ||
=======================================================================================================================================================
-1100 || 0 0 0 0 | 0 0 1 0 | 0 0 0 0 | 0 1 || 0 0 0 0 | 0 0 1 0 | 0 0 0 0 | 0 1 ||
-1087 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 ||
-1086 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 ||
-1075 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 ||
-1074 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 ||
-1062 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 ||
-955 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 ||
-954 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 ||
-944 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 ||
-943 || 0 0 0 0 | 0 1 11 12 | 0 0 0 0 | 0 24 || 0 0 0 0 | 0 1 11 12 | 0 0 0 0 | 0 24 ||
-942 || 0 0 0 0 | 0 1 9 12 | 0 0 0 0 | 0 22 || 0 0 0 0 | 0 1 9 12 | 0 0 0 0 | 0 22 ||
-932 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 ||
-931 || 0 0 0 0 | 0 1 10 12 | 0 0 0 0 | 0 23 || 0 0 0 0 | 0 1 10 12 | 0 0 0 0 | 0 23 ||
-930 || 0 0 0 0 | 0 1 9 12 | 0 0 0 0 | 0 22 || 0 0 0 0 | 0 1 9 12 | 0 0 0 0 | 0 22 ||
-918 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 ||
-810 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 ||
-800 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 ||
-799 || 0 0 0 0 | 0 1 11 12 | 0 0 0 0 | 0 24 || 0 0 0 0 | 0 1 11 12 | 0 0 0 0 | 0 24 ||
-798 || 0 0 0 0 | 0 1 10 12 | 0 0 0 0 | 0 23 || 0 0 0 0 | 0 1 10 12 | 0 0 0 0 | 0 23 ||
-797 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 ||
-787 || 0 0 0 0 | 0 1 11 12 | 0 0 0 0 | 0 24 || 0 0 0 0 | 0 1 11 12 | 0 0 0 0 | 0 24 ||
-786 || 0 0 0 0 | 0 1 10 12 | 0 0 0 0 | 0 23 || 0 0 0 0 | 0 1 10 12 | 0 0 0 0 | 0 23 ||
-785 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 ||
-775 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 ||
-655 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 ||
-654 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 ||
-653 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 ||
-644 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 ||
-643 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 ||
-642 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 || 0 0 0 0 | 0 0 4 0 | 0 0 0 0 | 0 4 ||
-630 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 ||
3223551 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 ||
3225307 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 ||
3271533 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 ||
3273289 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 ||
3812790 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 ||
3814546 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 ||
3860772 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 ||
3862528 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 || 0 0 0 0 | 0 0 2 0 | 0 0 0 0 | 0 2 ||
In the script, I check whether there are differences between the gradient and transport tasks. There are none so far, so this part seems to be fine.
Then using another script, I check whether the number of calls to every particle in the transport step is >= the number of calls in the gradient tasks. Here is some sample output, here for snapshot 3:
G!inT
means "called in gradient, but not in transport", and T!inG
means "called in transport, but not in gradient".
Checking hydro sanity pt2 3
transport calls < gradient calls
Particle ID 3235
-- cell ID 1012538540048
-- calls grad: 28 calls transport: 27
--- G!inT: ID 21785 cellID 1012538540048 | r/Hi: 0.985 r/Hj: 1.324
Particle ID 3957
-- cell ID 1012078364965
-- calls grad: 43 calls transport: 41
--- G!inT: ID 16569 cellID 1012078364965 | r/Hi: 1.002 r/Hj: 1.405
--- G!inT: ID 15122 cellID 1012078364965 | r/Hi: 1.002 r/Hj: 1.470
Particle ID 5230
-- cell ID 105629351935
-- calls grad: 25 calls transport: 24
--- G!inT: ID 12368 cellID 845034815486 | r/Hi: 1.001 r/Hj: 1.190
--- G!inT: ID 18406 cellID 845494990553 | r/Hi: 0.987 r/Hj: 1.457
--- T!inG: ID 1 cellID 107259138633 | r/Hi: 1.457 r/Hj: 0.838
Particle ID 6727
-- cell ID 107201616749
-- calls grad: 29 calls transport: 28
--- G!inT: ID 19290 cellID 857612933996 | r/Hi: 1.024 r/Hj: 1.271
Particle ID 11233
-- cell ID 845494990553
-- calls grad: 31 calls transport: 30
--- G!inT: ID 23430 cellID 845494990538 | r/Hi: 1.001 r/Hj: 1.103
Particle ID 14387
-- cell ID 999500246452
-- calls grad: 23 calls transport: 22
--- G!inT: ID 19459 cellID 999960421520 | r/Hi: 0.994 r/Hj: 1.462
Particle ID 14921
-- cell ID 999960421520
-- calls grad: 32 calls transport: 31
--- G!inT: ID 17348 cellID 999960421521 | r/Hi: 1.012 r/Hj: 1.335
Particle ID 15002
-- cell ID 107259138633
-- calls grad: 30 calls transport: 27
--- G!inT: ID 19155 cellID 107259138633 | r/Hi: 1.006 r/Hj: 1.129
--- G!inT: ID 14628 cellID 107259138633 | r/Hi: 1.006 r/Hj: 1.098
--- G!inT: ID 13003 cellID 107201616749 | r/Hi: 0.989 r/Hj: 1.316
Particle ID 16763
-- cell ID 1012078364965
-- calls grad: 26 calls transport: 25
--- G!inT: ID 15955 cellID 1012538540033 | r/Hi: 1.002 r/Hj: 1.290
--- G!inT: ID 5602 cellID 1012538540033 | r/Hi: 1.008 r/Hj: 1.298
--- T!inG: ID 2012 cellID 1012538540033 | r/Hi: 1.056 r/Hj: 0.962
Particle ID 21770
-- cell ID 857612933998
-- calls grad: 28 calls transport: 27
--- G!inT: ID 3142 cellID 857612933991 | r/Hi: 1.005 r/Hj: 1.125
--- G!inT: ID 23924 cellID 857612933991 | r/Hi: 0.989 r/Hj: 1.313
--- T!inG: ID 12946 cellID 858073109080 | r/Hi: 1.079 r/Hj: 0.966
Particle ID 22082
-- cell ID 1012538540033
-- calls grad: 34 calls transport: 33
--- G!inT: ID 4589 cellID 1012078364965 | r/Hi: 1.004 r/Hj: 1.458
Particle ID 23460
-- cell ID 1012078364967
-- calls grad: 41 calls transport: 38
--- G!inT: ID 6572 cellID 1012078364966 | r/Hi: 0.999 r/Hj: 1.074
--- G!inT: ID 15132 cellID 1012538540035 | r/Hi: 1.002 r/Hj: 1.307
--- G!inT: ID 21885 cellID 1012538540035 | r/Hi: 1.005 r/Hj: 1.225
A curious case is e.g.
Particle ID 15002
-- cell ID 107259138633
-- calls grad: 30 calls transport: 27
--- G!inT: ID 19155 cellID 107259138633 | r/Hi: 1.006 r/Hj: 1.129
--- G!inT: ID 14628 cellID 107259138633 | r/Hi: 1.006 r/Hj: 1.098
--- G!inT: ID 13003 cellID 107201616749 | r/Hi: 0.989 r/Hj: 1.316
so for some reason I get a
- self interaction (particles 15002, 19155, and 14628 are all in cell 107259138633) where
r/Hi
andr/Hj
> 1.0, whereHi
,Hj
are the compact support radii of particlesi
,j
(i
is in this case always particle 15002) - pair interaction with 13003 where
r/Hi: 0.989
, but this interaction is not happening in the transport step?