Hierarchical tasks recursion

changed the description

@lhausammann this implements the change to the cooling task we talked about. Should help with GEAR. Would be great if you could try it out. Note the code will break if you use the time-step limiter or time-step sync. Or if you run with these switched on, set Scheduler:engine_max_parts_per_kick to 10000000 or so.

@jkeger this implements what we talked about yesterday. It should push the kick and dt tasks to deeper level and help speed things up.

Note to both: I have barely tested this.

Thanks it looks great, average time to step improved from 6.0 s to 4.5 s!

However, I tried to make task plots but it hangs on step 2, so might there be a bug in collecting the task info?

I guess that you are planning to add the time step limiter / sync later?

I am able to produce the task plots, therefore I think they are fine.

Yes, if it actually improves things, I'll make the missing changes.

Great!

@lhausammann you likely do not need the changes to the kicks anyway so running with the parameter as given above will be fine for you. Means you can test it with the limiter and sync.

added 1 commit

9b3539f0 - Skip the new implicit tasks in the fake step

Compare with previous version

It seems that at high redshift, I do not see any difference. I will try at lower redshift.

It should make little to no difference in steps where most particles are active.

added 1 commit

abd4f4ce - Fix problem in cell_has_tasks()

Compare with previous version

I tried at high, medium (8) and low redshift. In all the cases, the new implementation is slower. My guess is that the default values are not good, the code is spending too much time recursing.

With an extremely high value of Scheduler:engine_max_parts_per_cooling, I get this task plot:

With the default value:

As you can see, now there is a lot less dead time. Currently, the main problem seems to be the dark green tasks (self/grav). If we forget about the green task, I think I should be able to obtain some kind of speedup with a bit of parameter tuning.

The task distribution is now great for me although with a much longer time before these tasks begin. More importantly it currently still hangs inconsistently on step 2, sometimes continuing fine and sometimes not.

Compared with this in the old version

@lhausammann you seem to have a gravity problem, not cooling. Tweaking the scheduler params there will help. I don't understand why the new version spends more time overall in the cooling task.

@jkeger OK, so the value I picked might be too extreme and we spend too much time unskipping.

The thing that confuses me in Loic's plots is that we should have less work in each cooling task but many more of them. Is that actually true from the stats? Are there more of them? and is the average time lower? Note that the whole recursion should be hidden as it happens in the scheduler, not in the tasks themselves anymore.

What value is "a large value"? And how many particles do you have here?

Also, every single task looks longer in that second run. Has anything else changed between the two runs?

This is without the atomics for gravity.

Yes, that's a bad choice. Atomics are the new best way to go as you can see from this test :)

I was surprised by the gravity task, I just wanted to check if everything was fine with the ato

The only change between the two runs is the value for Scheduler:engine_max_parts_per_cooling (I am not recompiling).

# type/subtype     :   count   minimum   maximum       sum      mean   percent
# All threads
# Low number of parts / cooling
cooling/none           :    1030    0.0071   36.6447 4532.1986    4.4002     20.17
# high number of parts / cooling
cooling/none           :     107    0.0073  156.3623 1125.2574   10.5164      6.34

I am using 10k particles per cooling task.

And it's running on the same system? Does grackle use some sort of multi-threading or something behind our back?

I think the reason why the cooling looks better than last year is due to the task order. Before I was doing it after the kick and had to wait on the end of the gravity while here I am using the normal task order and then can do it at the same time than gravity.

I know that grackle can use some multi-threading, but I did not compile it with it. Yes same system, but I will try to force the same node for the two computation.

added 1 commit

28b3f28b - Change file permission for task dependency plotting script

Compare with previous version

Something strange is happening with grackle. I replaced the call to grackle with a call to a sleep function (300ns).

With the splitting

Without

Ah excellent! So it does what it should on the SWIFT side. Good idea to use sleep() as a test!

But somehow, running too much grackle in parallel slows it down. Maybe that's an issue and grackle is limited by memory speed or it spawns threads in the background. If you run in gdb, you will know whether any threads are created for instance.

added 1 commit

b525c566 - Move the time-step task back to the super level

Compare with previous version

When do you spawn threads in SWIFT?

I am running with 5 threads, I am spawning in total 17 threads:

....
INIT COOLING HERE

[00000.2] main: Reading ICs from file './h050_0776.hdf5'
[00000.2] io_read_unit_system: Reading IC units from ICs.
[00000.2] read_ic_single: IC and internal units match. No conversion needed.
  [New Thread 0x2aaabbbbb700 (LWP 22994)]
  [New Thread 0x2aaabbdbc700 (LWP 22995)]
  [New Thread 0x2aaabbfbd700 (LWP 22996)]
  [New Thread 0x2aaabc1be700 (LWP 22997)]
    [Thread 0x2aaabc1be700 (LWP 22997) exited]
    [Thread 0x2aaabbbbb700 (LWP 22994) exited]
    [Thread 0x2aaabbfbd700 (LWP 22996) exited]
    [Thread 0x2aaabbdbc700 (LWP 22995) exited]
[00000.9] main: Reading initial conditions took 671.391 ms.
[00000.9] main: Read 323031 gas particles, 1825 stars particles, 0 black hole particles, 324856 DM particles and 362888 DM background particles from the ICs.
[00001.0] main: space_init took 81.087 ms.
[00001.0] potential_print_backend: External potential is 'No external potential'.
....
[00001.0] engine_config: Absolute minimal timestep size: 1.800222e-18
[00001.0] engine_config: Minimal timestep size (on time-line): 9.958252e-17
[00001.0] engine_config: Maximal timestep size (on time-line): 5.605998e-02
[00001.0] engine_config: Restarts will be dumped every 4.000000 hours
  [New Thread 0x2aaabc1be700 (LWP 23001)]
  [New Thread 0x2aaabbfbd700 (LWP 23002)]
  [New Thread 0x2aaabbdbc700 (LWP 23003)]
  [New Thread 0x2aaabbbbb700 (LWP 23004)]
  [New Thread 0x2aaac43cf700 (LWP 23005)]
  [New Thread 0x2aaac45d0700 (LWP 23006)]
  [New Thread 0x2aaac47d1700 (LWP 23007)]
  [New Thread 0x2aaac49d2700 (LWP 23008)]
  [New Thread 0x2aaac4bd3700 (LWP 23009)]
[00001.0] main: engine_init took 20.416 ms.
[00001.0] main: Running on 323031 gas particles, 1825 stars particles 0 black hole particles and 687744 DM particles (1012600 gravity particles)
[00001.0] main: from t=1.054e+01 until t=1.413e+01 with 1 ranks, 5 threads / rank and 5 task queues / rank (dt_min=1.000e-16, dt_max=1.000e-01)...
[00001.0] engine_init_particles: Setting particles to a valid state...
[00001.1] engine_init_particles: Computing initial gas densities and approximate gravity.
  [New Thread 0x2aaabf154700 (LWP 23012)]
  [New Thread 0x2aaabf355700 (LWP 23013)]
  [New Thread 0x2aaabfe00700 (LWP 23014)]
  [New Thread 0x2aaac0e00700 (LWP 23015)]
[00075.4] engine_init_particles: Converting internal energy variable.
[00075.4] engine_init_particles: Running initial fake time-step.
#   Step           Time Scale-factor     Redshift      Time-step Time-bins      Updates    g-Updates    s-Updates    b-Updates  Wall-clock time [ms]  Props
       0   1.054270e+01    0.7714840    0.2962032   0.000000e+00    1   56       323031      1012600         1825            0            118189.188      9
....
      21   1.054376e+01    0.7715481    0.2960954   5.024007e-05   40   40            2            2            0            0                11.086      0
    [Thread 0x2aaac0e00700 (LWP 23015) exited]
    [Thread 0x2aaabf355700 (LWP 23013) exited]
    [Thread 0x2aaabfe00700 (LWP 23014) exited]
    [Thread 0x2aaac45d0700 (LWP 23006) exited]
    [Thread 0x2aaabf154700 (LWP 23012) exited]
    [Thread 0x2aaac4bd3700 (LWP 23009) exited]
    [Thread 0x2aaac49d2700 (LWP 23008) exited]
    [Thread 0x2aaac43cf700 (LWP 23005) exited]
    [Thread 0x2aaac47d1700 (LWP 23007) exited]
    [Thread 0x2aaabc1be700 (LWP 23001) exited]
    [Thread 0x2aaabbbbb700 (LWP 23004) exited]
    [Thread 0x2aaabbfbd700 (LWP 23002) exited]

The SWIFT ones are all created in engine_init(). And the ones here at the start are just for reading and are then destroyed.

You should expect 4 threadpool threads (+1 which the master thread) and 5 threads for the tasks. There might be some in FFTW and these are maybe those you see after engine_init_particles.

@lhausammann is grackle using some sort of internal state that is shared among all the calling threads?

I know that grackle was doing it and I modified grackle in order to avoid it. Maybe they have a second variable. I will check, thanks for the suggestion.

I don't enough about it to help. The reason I am asking is that you use what the documentation calls thread-safe functions. So that would indicate a need to share some data. These are likely much slower than other functions as well.

I created those "thread-safe" functions. The cooling was totally wrong without them or crashing (I do not really remember), I only checked the results and not the speed.

In grackle, it seems that there is only two global variables. In SWIFT, every time I compute the cooling for a particle, I copy the structure of the global constants into local variables and then provide them to the thread safe functions. Do you think it could be a problem?

I tried running swift on a single thread and without the threaded version of FFTW. Only a single thread was created. If grackle was creating threads, I am expecting a large number of them as it uses OpenMP.

Good test. Then it's not extra threads.

Can you run the same time-step as above with one single SWIFT thread? And then compare the sum of the time in the cooling function to the sum of the time in the case where you are running 24 threads? The speed boost on 1 core will mess up the measurement a bit but it's still interesting to see whether time is waster in locks inside grackle.

You mean you created these on the grackle side: https://grackle.readthedocs.io/en/latest/Reference.html#local-functions

Yes I did create them.


# Task times:
# -----------
# type/subtype     :   count   minimum   maximum       sum      mean   percent
# All threads (simulation with 1 thread) : 
cooling/none           :    1030    0.0068    1.0083  146.0739    0.1418      2.24
# All threads (simulation with a lot of threads): 
cooling/none           :    1030    0.0078   36.4843 3655.2676    3.5488     16.38

Huge difference between the two :/

Great. So then that means whatever is happening in these functions is the problem. It must be related to locks or mutexes or any other trick that was used to solve the concurrent access problems.

merged

mentioned in merge request !1110 (merged)

Hierarchical tasks recursion

Merge request reports

Activity