Dopair1 vectorisation merge

There were just a couple of things I had questions about:

Shall I put the branching to call the default DOPAIR1 for the corner interactions in runner_main or in runner_dopair1_density_vec?
Shall I put the if(ci->isActive) checks around the two loops in runner_dopair1_density_vec like in the serial version?
And finally, is the change I made with the fake particles in cache.h seem okay to you? (I want them to return false when I check the distance r2 < hig2)

Added 1 commit:

05cfc439 - Removed last change as Sedov Blast 3D fails to run.

We probably want to have a switching function somewhere as we also need to switch for the unsorted version over MPI. I'd have a aingle function that this multiple switch depending on MPI status, vectorization and direction.
You tell me whether it makes the whole thing faster or not.
What are you after: preventing active particles from being updated or preventing inactive particles from updating active ones ?

My questions:

What's the speed-up in the EAGLE_25 case ?

I realised we did not address (3) when chatting yesterday. Let me know what you are after and we can see whether we have other generic solutions.

Added 1 commit:

b8702709 - Created function to determine whether a pair of cells are corner facing by the sort ID.

Added 1 commit:

1b482dcc - Created a function that branches calls to DOPAIR1 so that correct version is cal…

I have created a function in runner_doiact.h that addresses this problem. See 1b482dcc. What do you think? In runner.c you then just call runner_dopair1_branch_density. The only problem is space_getsid, this function not only finds the sort ID it might swap the cells around. So the cells would get flipped twice if it is also called in DOPAIR1 So I can I either pass sid and shift to DOPAIR1 as arguments or create another function that determines these without flipping the cells.
I have added the if statements to the code and it makes no difference in test27cells as all particles are active, but it may have an effect on EAGLE_25 so I will run benchmarks with and without the change.
I basically want to prevent active particles from being updated by fake particles when the cache is padded if I go beyond the bounds of the number of particles read into the cache.

Sorry I didn't address your question. I will submit a job to find how much faster the code is with runner_dopair1_density_vec. But I have quite a few jobs queued up in the swifttest queue, so it might not run until tomorrow

Added 31 commits:

1b482dcc...4500d31f - 30 commits from branch master
cb0c811f - Merge branch 'master' into dopair-vectorisation-merge

I've ran benchmarks for cubic-spline C4, but there was barely any speedup. I'm running again for wendland C2. In the meantime I ran it through vector advisor and I found that runner_dopair1_subset_density is being called. This is correct isn't it? I can replace this call with my runner_dopair1_density_vec?

No you can't. That's the version that only acts on a subset of particles when we have an incorrect value of "h".

But if this function appears high in the list, it might be worth looking at vectorizing it...

So my vectorisation work gives no speedup whatsoever...

Cubic Spline

Wendland C2

The only thing I can think is happening is that the particles are inactive most of the time in EAGLE_25 and the overhead of reading the cache dominates. I could start looking at only reading the active particles, but then you still need the inactive particles for the pair interactions of the neighbouring cell

Mmmhhh... ok... can you confirm your hypothesis by getting the vTune/adviser take on the dopair1_density_vec function ?

Or maybe by printing how many particles are active in ci and cj for each call to the function ? That would alow you to assess whether you spend too much time in the overheads.

I found some interesting statistics in topcat. So I printed the number of active particles in ci and cj for each call to runner_dopair1_density_vec. The mean number of active particles was: 43. But 28% of the time there were no active particles present. Which means I am missing looping over some active particles or this line of the code is not working:

if (!cell_is_active(ci, e) && !cell_is_active(cj, e)) return;

I will do the same analysis on the master and check that I do not get zero active particles.

The mean number did not include corner interactions

I am confused as to why this would happen but I have never checked. In situation where this takes place, could you check whether it's a straight call to DOPAIR1() or a call via a recursion in the sub-task ?

As for the statistics, I think we need more than just the mean. In some time-steps all particles are active and then the cache construction is a small overhead but in many cases only a handful of particles are active and the cache construction is expensive. Could you make a full histogram ?

Also, if a cell has no active particles within max_d from the centre, do you abort ?

It might also be worth checking how many active particles you have in ci and in cj separately. I suspect that there are many cases where there are none in either of the two cells. In that situation it may be necessary to build the caches in a different way. Construct a cache of "sinks" from the active ci's and and a cache of "sources" for the active particles in cj. Or vice-versa depending on who is active. That will allow you to significantly reduce the time spent building the caches.

Also, for the speed-up remember that DOPAIR2 is a larger fraction of time than DOPAIR1 and that more interactions are computed in DOPAIR2 so vectorization will be beneficial there.

Histogram for active particles in ci

The majority of calls have very little active particles.

I do not check whether particles are active when constructing the cache, I should look at that. Although I am worried about the 28% of cases that contain no active particles and the function has not already been exited with:

if (!cell_is_active(ci, e) && !cell_is_active(cj, e)) return;

I will look at the cases where this happens and see if this happens in recursive calls.

"I suspect that there are many cases where there are none in either of the two cells."

I have stats for ci and cj, if there are no active particles in either ci or cj I shouldn't be constructing the cache at all, because there's no work to do.

For the master, 31% of calls contained no active particles and the mean number of active particles was: 39

Dopair1 vectorisation merge

Summary

Activity

Cubic Spline

Wendland C2

Histogram for active particles in ci

Dopair1 vectorisation merge

Summary

Merge request reports

Activity

Cubic Spline

Wendland C2

Histogram for active particles in ci