Dopair1 vectorisation merge
Summary
- Contains a vectorised version of
runner_dopair1_density
-
runners
now have two particle caches to hold cell ci and cell cj - The
cache
has been updated to includemax_d
which is the maximum distance into the neighbouring cell -
cache.h
contains more functions to read two cells into cache and to read a subset of particles from each cell into the cache
Merge request reports
Activity
There were just a couple of things I had questions about:
- Shall I put the branching to call the default
DOPAIR1
for the corner interactions inrunner_main
or inrunner_dopair1_density_vec
? - Shall I put the
if(ci->isActive)
checks around the two loops inrunner_dopair1_density_vec
like in the serial version? - And finally, is the change I made with the fake particles in
cache.h
seem okay to you? (I want them to return false when I check the distance r2 < hig2)
- Shall I put the branching to call the default
Added 1 commit:
- 05cfc439 - Removed last change as Sedov Blast 3D fails to run.
- We probably want to have a switching function somewhere as we also need to switch for the unsorted version over MPI. I'd have a aingle function that this multiple switch depending on MPI status, vectorization and direction.
- You tell me whether it makes the whole thing faster or not.
- What are you after: preventing active particles from being updated or preventing inactive particles from updating active ones ?
My questions:
- What's the speed-up in the EAGLE_25 case ?
Added 1 commit:
- b8702709 - Created function to determine whether a pair of cells are corner facing by the sort ID.
Added 1 commit:
- 1b482dcc - Created a function that branches calls to DOPAIR1 so that correct version is cal…
-
I have created a function in
runner_doiact.h
that addresses this problem. See 1b482dcc. What do you think? Inrunner.c
you then just callrunner_dopair1_branch_density
. The only problem isspace_getsid
, this function not only finds the sort ID it might swap the cells around. So the cells would get flipped twice if it is also called inDOPAIR1
So I can I either passsid
andshift
toDOPAIR1
as arguments or create another function that determines these without flipping the cells. -
I have added the
if
statements to the code and it makes no difference intest27cells
as all particles are active, but it may have an effect on EAGLE_25 so I will run benchmarks with and without the change. -
I basically want to prevent active particles from being updated by fake particles when the cache is padded if I go beyond the bounds of the number of particles read into the cache.
-
Added 31 commits:
-
1b482dcc...4500d31f - 30 commits from branch
master
- cb0c811f - Merge branch 'master' into dopair-vectorisation-merge
-
1b482dcc...4500d31f - 30 commits from branch
I've ran benchmarks for cubic-spline C4, but there was barely any speedup. I'm running again for wendland C2. In the meantime I ran it through vector advisor and I found that
runner_dopair1_subset_density
is being called. This is correct isn't it? I can replace this call with myrunner_dopair1_density_vec
?So my vectorisation work gives no speedup whatsoever...
Cubic Spline
Wendland C2
The only thing I can think is happening is that the particles are inactive most of the time in EAGLE_25 and the overhead of reading the cache dominates. I could start looking at only reading the active particles, but then you still need the inactive particles for the pair interactions of the neighbouring cell
I found some interesting statistics in topcat. So I printed the number of active particles in ci and cj for each call to
runner_dopair1_density_vec
. The mean number of active particles was: 43. But 28% of the time there were no active particles present. Which means I am missing looping over some active particles or this line of the code is not working:if (!cell_is_active(ci, e) && !cell_is_active(cj, e)) return;
I will do the same analysis on the
master
and check that I do not get zero active particles.Edited by James WillisI am confused as to why this would happen but I have never checked. In situation where this takes place, could you check whether it's a straight call to DOPAIR1() or a call via a recursion in the sub-task ?
As for the statistics, I think we need more than just the mean. In some time-steps all particles are active and then the cache construction is a small overhead but in many cases only a handful of particles are active and the cache construction is expensive. Could you make a full histogram ?
Also, if a cell has no active particles within max_d from the centre, do you abort ?
It might also be worth checking how many active particles you have in ci and in cj separately. I suspect that there are many cases where there are none in either of the two cells. In that situation it may be necessary to build the caches in a different way. Construct a cache of "sinks" from the active ci's and and a cache of "sources" for the active particles in cj. Or vice-versa depending on who is active. That will allow you to significantly reduce the time spent building the caches.
Also, for the speed-up remember that DOPAIR2 is a larger fraction of time than DOPAIR1 and that more interactions are computed in DOPAIR2 so vectorization will be beneficial there.
Histogram for active particles in ci
The majority of calls have very little active particles.
I do not check whether particles are active when constructing the cache, I should look at that. Although I am worried about the 28% of cases that contain no active particles and the function has not already been exited with:
if (!cell_is_active(ci, e) && !cell_is_active(cj, e)) return;
I will look at the cases where this happens and see if this happens in recursive calls.
"I suspect that there are many cases where there are none in either of the two cells."
I have stats for ci and cj, if there are no active particles in either ci or cj I shouldn't be constructing the cache at all, because there's no work to do.