Push the cooling task to a lower level to gain more parallelism

added 1 commit

a73d9fce - Add the missing new tasks to the interactive task plotting script

The beginning of the discussion was done here !1108 (comment 30096)

I have changed grackle in order to do an expensive and useless loop just after entering the fortran function and then exit the function.

# type/subtype     :   count   minimum   maximum       sum      mean   percent
# All threads : 
cooling/none           :    1030    7.4340 1146.7485 164225.8926  159.4426     95.96
cooling/none           :      58    7.4375 114690.2099 164176.1506 2830.6233     14.22

As you can see, we have the same total time, therefore calling a fortran code is not the problem.

Sorry about the git mess. For my future testing reference, this gives a great speed up with engine_max_parts_per_kick = 1e6, but after multiple steps crashed with [0001] [00634.3] runner_doiact_grav.c:runner_do_grav_down():71: cp->multipole not drifted. application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1

Note that the changes to the kick are not included here.

@lhausammann can you show me in the grackle code where your local functions lock objects in memory to ensure thread-safety?

Arg, thanks for the question. I was not looking to the correct function ><. I forgot to copy one of the variable. By the way, the structure contains a lot of pointers, maybe it is due to the data pointed by them.

I should be doing this one https://gitlab.cosma.dur.ac.uk/swift/swiftsim/blob/master/src/cooling/grackle/cooling.c#L713

and I am doing this one https://gitlab.cosma.dur.ac.uk/swift/swiftsim/blob/master/src/cooling/grackle/cooling.c#L677

Then I am calling this function https://github.com/grackle-project/grackle/blob/master/src/clib/solve_chemistry.c#L87

Then grackle is calling a fortran function https://github.com/grackle-project/grackle/blob/master/src/clib/solve_chemistry.c#L171.

I can't see any thread-safety calls in there. Is this all hidden in the fortan code?

It is safer than before. Internally, grackle is changing some variables at each call in the structure. By copying them, I avoid this problem. Therefore it is some kind of thread safety, but it is not the cleanest approach.

The copying itself will exist in the 1-thread and 16-thread version so that can't be what is expensive. There must be a section of code where threads try to access a shared resource. Do you copy back in a shared location what you read and need to lock stuff at this point?

I am not an expert on grackle, but I do not think they are locking anything. They expect the user to have a single thread / MPI rank and then they use OMP inside grackle. Therefore, I think they should not have any lock.

When I copy the structures, I copy some pointers, therefore the different threads will access the same arrays (but should not write inside them). Do you think the common access to the arrays may be the problem?

Right, so OMP may do some locking or atomic work in its calls.

Accessing the same data comes at no penalty unless it is done atomically.

But OMP should create new threads, right? Therefore we should see it through gdb?

Even without creating extra threads, the code can be protected for concurrent access. That may be what you need but could be the expensive bottleneck.

My compiled version of grackle does not link to OMP. Therefore the problem should be something else.

When compiling grackle, I replaced O3 to O0. It reduces a bit the difference between the two runs, but it does not fix the problem.

Do you have access to performance analysis tools like vTune or map? That will tell you on which line of code the time is actually spent.

I was using gcc and now I tried with intel:


# type/subtype     :   count   minimum   maximum       sum      mean   percent
# All threads (new) : 
cooling/none           :    1030    0.0090    1.3203  193.7010    0.1881      2.86
# All threads (old) : 
cooling/none           :      58    0.0105  133.7666  192.3123    3.3157      2.72

In the same order:

As you can see, it seems that the problem is linked to gcc. Therefore vtune will not be very useful to find the problem. The result from vtune is that grackle is speeding most of its time in __libc_malloc (~43%).

Do you mean grackle is compiled with ICC/GCC or SWIFT (or both)? You could link grackle to tbbmalloc or any other better mallocs. And note also that vTune also works for non-ICC binaries.

Oh, I was sure that it was only working with ICC. Thanks for the information :)

I compiled with ICC both SWIFT and grackle. Previously both were compiled with GCC.

It seems that the default makefile in grackle is a bit shitty. I have written my own and now I get that with GCC:

New:

Old:

# type/subtype     :   count   minimum   maximum       sum      mean   percent
# All threads (new) : 
cooling/none           :    1030    0.0094    1.4294  206.9717    0.2009      2.25
# All threads (old) : 
cooling/none           :      58    0.0117  142.4641  205.2999    3.5397      2.24

And indeed, malloc and friends needs to be careful about threading. You have a perfect example here of how rubbish the default glibc malloc can be...

Can you show me your Scheduler section?

Scheduler:
  max_top_level_cells: 16
  engine_max_parts_per_cooling: 100   # (Optional) Maximum number of parts per cooling task.
  engine_max_parts_per_kick: 100    # (Optional) Maximum number of parts per cooling task.

Maybe try adding:

cell_sub_size_self_grav:   2048
cell_subdepth_diff_grav:   2
cell_split_size:           200

added 1 commit

d0ebee65 - Better default value for the cooling splitting for the default case where the…

Compare with previous version

added 3 commits

28763715 - Update the master example YAML file to showcase the new default value
9edeb5ac - Added Planetary IC file that was missing from the tree
4d5fa30b - Update the RTD to describe the new parameter

Compare with previous version

assigned to @matthieu

Thanks for the suggestion!

Ah excellent!

Anything else you wanted to test here? Note that the changes to the kick tasks are not included here.

I am really happy with the changes. The kicks are already merged, right?

No, the kicks look very promising for me but have an unsolved bug that we'll come back to look at soon.

So they can wait for now, sadly

merged

mentioned in commit ac275e00

Push the cooling task to a lower level to gain more parallelism

Merged by Matthieu Schaller 4 years ago (Jul 7, 2020 7:41am UTC) 4 years ago

Activity

Push the cooling task to a lower level to gain more parallelism

Merge request reports

Merged by Matthieu Schaller 4 years ago (Jul 7, 2020 7:41am UTC) 4 years ago

Activity