Hierarchical tasks recursion
Implement two changes:
- Push the cooling task to deeper level
- Push the kick2/dt/kick1 tasks to deeper level
To do:
- Properly deal with the time-step limiter
Merge request reports
Activity
@lhausammann this implements the change to the cooling task we talked about. Should help with GEAR. Would be great if you could try it out. Note the code will break if you use the time-step limiter or time-step sync. Or if you run with these switched on, set
Scheduler:engine_max_parts_per_kick
to10000000
or so.@jkeger this implements what we talked about yesterday. It should push the kick and dt tasks to deeper level and help speed things up.
Note to both: I have barely tested this.
@lhausammann you likely do not need the changes to the kicks anyway so running with the parameter as given above will be fine for you. Means you can test it with the limiter and sync.
added 1 commit
- 9b3539f0 - Skip the new implicit tasks in the fake step
I tried at high, medium (8) and low redshift. In all the cases, the new implementation is slower. My guess is that the default values are not good, the code is spending too much time recursing.
With an extremely high value of
Scheduler:engine_max_parts_per_cooling
, I get this task plot:As you can see, now there is a lot less dead time. Currently, the main problem seems to be the dark green tasks (self/grav). If we forget about the green task, I think I should be able to obtain some kind of speedup with a bit of parameter tuning.
@lhausammann you seem to have a gravity problem, not cooling. Tweaking the scheduler params there will help. I don't understand why the new version spends more time overall in the cooling task.
@jkeger OK, so the value I picked might be too extreme and we spend too much time unskipping.
The thing that confuses me in Loic's plots is that we should have less work in each cooling task but many more of them. Is that actually true from the stats? Are there more of them? and is the average time lower? Note that the whole recursion should be hidden as it happens in the scheduler, not in the tasks themselves anymore.
What value is "a large value"? And how many particles do you have here?
I was surprised by the gravity task, I just wanted to check if everything was fine with the ato
The only change between the two runs is the value for
Scheduler:engine_max_parts_per_cooling
(I am not recompiling).# type/subtype : count minimum maximum sum mean percent # All threads # Low number of parts / cooling cooling/none : 1030 0.0071 36.6447 4532.1986 4.4002 20.17 # high number of parts / cooling cooling/none : 107 0.0073 156.3623 1125.2574 10.5164 6.34
I think the reason why the cooling looks better than last year is due to the task order. Before I was doing it after the kick and had to wait on the end of the gravity while here I am using the normal task order and then can do it at the same time than gravity.
I know that grackle can use some multi-threading, but I did not compile it with it. Yes same system, but I will try to force the same node for the two computation.
added 1 commit
- 28b3f28b - Change file permission for task dependency plotting script
Ah excellent! So it does what it should on the SWIFT side. Good idea to use sleep() as a test!
But somehow, running too much grackle in parallel slows it down. Maybe that's an issue and grackle is limited by memory speed or it spawns threads in the background. If you run in gdb, you will know whether any threads are created for instance.
added 1 commit
- b525c566 - Move the time-step task back to the super level
When do you spawn threads in SWIFT?
I am running with 5 threads, I am spawning in total 17 threads:
.... INIT COOLING HERE [00000.2] main: Reading ICs from file './h050_0776.hdf5' [00000.2] io_read_unit_system: Reading IC units from ICs. [00000.2] read_ic_single: IC and internal units match. No conversion needed. [New Thread 0x2aaabbbbb700 (LWP 22994)] [New Thread 0x2aaabbdbc700 (LWP 22995)] [New Thread 0x2aaabbfbd700 (LWP 22996)] [New Thread 0x2aaabc1be700 (LWP 22997)] [Thread 0x2aaabc1be700 (LWP 22997) exited] [Thread 0x2aaabbbbb700 (LWP 22994) exited] [Thread 0x2aaabbfbd700 (LWP 22996) exited] [Thread 0x2aaabbdbc700 (LWP 22995) exited] [00000.9] main: Reading initial conditions took 671.391 ms. [00000.9] main: Read 323031 gas particles, 1825 stars particles, 0 black hole particles, 324856 DM particles and 362888 DM background particles from the ICs. [00001.0] main: space_init took 81.087 ms. [00001.0] potential_print_backend: External potential is 'No external potential'. .... [00001.0] engine_config: Absolute minimal timestep size: 1.800222e-18 [00001.0] engine_config: Minimal timestep size (on time-line): 9.958252e-17 [00001.0] engine_config: Maximal timestep size (on time-line): 5.605998e-02 [00001.0] engine_config: Restarts will be dumped every 4.000000 hours [New Thread 0x2aaabc1be700 (LWP 23001)] [New Thread 0x2aaabbfbd700 (LWP 23002)] [New Thread 0x2aaabbdbc700 (LWP 23003)] [New Thread 0x2aaabbbbb700 (LWP 23004)] [New Thread 0x2aaac43cf700 (LWP 23005)] [New Thread 0x2aaac45d0700 (LWP 23006)] [New Thread 0x2aaac47d1700 (LWP 23007)] [New Thread 0x2aaac49d2700 (LWP 23008)] [New Thread 0x2aaac4bd3700 (LWP 23009)] [00001.0] main: engine_init took 20.416 ms. [00001.0] main: Running on 323031 gas particles, 1825 stars particles 0 black hole particles and 687744 DM particles (1012600 gravity particles) [00001.0] main: from t=1.054e+01 until t=1.413e+01 with 1 ranks, 5 threads / rank and 5 task queues / rank (dt_min=1.000e-16, dt_max=1.000e-01)... [00001.0] engine_init_particles: Setting particles to a valid state... [00001.1] engine_init_particles: Computing initial gas densities and approximate gravity. [New Thread 0x2aaabf154700 (LWP 23012)] [New Thread 0x2aaabf355700 (LWP 23013)] [New Thread 0x2aaabfe00700 (LWP 23014)] [New Thread 0x2aaac0e00700 (LWP 23015)] [00075.4] engine_init_particles: Converting internal energy variable. [00075.4] engine_init_particles: Running initial fake time-step. # Step Time Scale-factor Redshift Time-step Time-bins Updates g-Updates s-Updates b-Updates Wall-clock time [ms] Props 0 1.054270e+01 0.7714840 0.2962032 0.000000e+00 1 56 323031 1012600 1825 0 118189.188 9 .... 21 1.054376e+01 0.7715481 0.2960954 5.024007e-05 40 40 2 2 0 0 11.086 0 [Thread 0x2aaac0e00700 (LWP 23015) exited] [Thread 0x2aaabf355700 (LWP 23013) exited] [Thread 0x2aaabfe00700 (LWP 23014) exited] [Thread 0x2aaac45d0700 (LWP 23006) exited] [Thread 0x2aaabf154700 (LWP 23012) exited] [Thread 0x2aaac4bd3700 (LWP 23009) exited] [Thread 0x2aaac49d2700 (LWP 23008) exited] [Thread 0x2aaac43cf700 (LWP 23005) exited] [Thread 0x2aaac47d1700 (LWP 23007) exited] [Thread 0x2aaabc1be700 (LWP 23001) exited] [Thread 0x2aaabbbbb700 (LWP 23004) exited] [Thread 0x2aaabbfbd700 (LWP 23002) exited]
The SWIFT ones are all created in engine_init(). And the ones here at the start are just for reading and are then destroyed.
You should expect 4 threadpool threads (+1 which the master thread) and 5 threads for the tasks. There might be some in FFTW and these are maybe those you see after engine_init_particles.
@lhausammann is grackle using some sort of internal state that is shared among all the calling threads?
I created those "thread-safe" functions. The cooling was totally wrong without them or crashing (I do not really remember), I only checked the results and not the speed.
In grackle, it seems that there is only two global variables. In SWIFT, every time I compute the cooling for a particle, I copy the structure of the global constants into local variables and then provide them to the thread safe functions. Do you think it could be a problem?
I tried running swift on a single thread and without the threaded version of FFTW. Only a single thread was created. If grackle was creating threads, I am expecting a large number of them as it uses OpenMP.
Good test. Then it's not extra threads.
Can you run the same time-step as above with one single SWIFT thread? And then compare the sum of the time in the cooling function to the sum of the time in the case where you are running 24 threads? The speed boost on 1 core will mess up the measurement a bit but it's still interesting to see whether time is waster in locks inside grackle.
You mean you created these on the grackle side: https://grackle.readthedocs.io/en/latest/Reference.html#local-functions
Yes I did create them.
# Task times: # ----------- # type/subtype : count minimum maximum sum mean percent # All threads (simulation with 1 thread) : cooling/none : 1030 0.0068 1.0083 146.0739 0.1418 2.24 # All threads (simulation with a lot of threads): cooling/none : 1030 0.0078 36.4843 3655.2676 3.5488 16.38
Huge difference between the two :/
mentioned in merge request !1110 (merged)