[WIP] Re-entrant Threadpool Mapper
Quick hack of a re-entrant mapper for the threadpool, i.e. a mapper to which additional data can be added on the fly.
Merge request reports
Activity
@matthieu, can you have a look at this?
Currently I've only converted the
runner_do_unskip
mapper to this mode of operation, could you possibly run theEAGLE_25
benchmark and plot the threadpool tasks for the first few steps? What I'm most interested in is the behaviour of the smallest steps, i.e. the threadpool tasks should now parallelize much better.Cheers!
@jwillis could you run this on your special node against the latest master when you have some time ? Thanks!
I'll make some threadpool task plots in parallel.
@jwillis, yes, it's the only thing I have... Any idea what it's hanging on?
I have ran
EAGLE_12
with GCC on 4 threads and with-v 1
I get this far:Welcome to the cosmological hydrodynamical code ______ _________________ / ___/ | / / _/ ___/_ __/ \__ \| | /| / // // /_ / / ___/ /| |/ |/ // // __/ / / /____/ |__/|__/___/_/ /_/ SPH With Inter-dependent Fine-grained Tasking Version : 0.6.0 Revision: v0.6.0-209-g286fc4ff, Branch: threadpool_rmapper, Date: 2017-08-29 22:36:25 +0200 Webpage : www.swiftsim.com Config. options: '--disable-doxygen-doc --disable-mpi --enable-debugging-checks' Compiler: GCC, Version: 4.8.1 CFLAGS : '-O3 -fomit-frame-pointer -malign-double -fstrict-aliasing -ffast-math -funroll-loops -march=corei7-avx -mavx -pthread -Wall -Wextra -Wno-unused-parameter -Werror' HDF5 library version: 1.8.9 FFTW library version: 3.x (details not available) [00000.0] main: CPU frequency used for tick conversion: 2600000000 Hz [00000.0] main: Running on: m5019 [00000.0] main: WARNING: Debugging checks activated. Code will be slower ! [00000.0] main: sizeof(part) is 160 bytes. [00000.0] main: sizeof(xpart) is 64 bytes. [00000.0] main: sizeof(spart) is 96 bytes. [00000.0] main: sizeof(gpart) is 128 bytes. [00000.0] main: sizeof(multipole) is 160 bytes. [00000.0] main: sizeof(grav_tensor) is 288 bytes. [00000.0] main: sizeof(task) is 64 bytes. [00000.0] main: sizeof(cell) is 768 bytes. [00000.0] main: Reading runtime parameters from file 'eagle_12.yml' [00000.0] main: Internal unit system: U_M = 1.989000e+43 g. [00000.0] main: Internal unit system: U_L = 3.085678e+24 cm. [00000.0] main: Internal unit system: U_t = 3.085678e+19 s. [00000.0] main: Internal unit system: U_I = 1.000000e+00 A. [00000.0] main: Internal unit system: U_T = 1.000000e+00 K. [00000.0] phys_const_print: Gravitational constant = 4.302051e+01 [00000.0] phys_const_print: Speed of light = 2.997925e+05 [00000.0] phys_const_print: Planck constant = 9.787529e-02 [00000.0] phys_const_print: Boltzmann constant = 6.941420e-70 [00000.0] phys_const_print: Thomson cross-section = 6.986843e-74 [00000.0] phys_const_print: Electron-Volt = 8.055187e-66 [00000.0] phys_const_print: Year = 1.022690e-12 [00000.0] phys_const_print: Astronomical Unit = 4.848136e-12 [00000.0] phys_const_print: Parsec = 9.999999e-07 [00000.0] phys_const_print: Solar mass = 9.997486e-11 [00000.0] main: Reading ICs from file './EAGLE_ICs_12.hdf5' [00000.0] read_ic_single: IC and internal units match. No conversion needed. [00004.6] read_ic_single: Particle Type 5 not yet supported. Particles ignored [00004.6] main: Reading initial conditions took 4599.469 ms. [00004.6] main: Read 6387423 gas particles, 0 star particles and 0 gparts from the ICs. [00004.6] space_init: max_size set to 8000000, sub_size_pair set to 256000000, sub_size_self set to 32000, split_size set to 400 [00005.1] space_regrid: h_max is 3.480e-01 (cell_min=6.989e-01). [00005.1] space_regrid: (re)griding space cdim=(12 12 12) [00005.1] space_regrid: set cell dimensions to [ 12 12 12 ]. [00005.1] space_regrid: took 60.905 ms. [00005.1] main: space_init took 461.842 ms. [00005.1] main: space dimensions are [ 8.471 8.471 8.471 ]. [00005.1] main: space is periodic. [00005.1] main: highest-level cell dimensions are [ 12 12 12 ]. [00005.1] main: 6387423 parts in 1728 cells. [00005.1] main: 0 gparts in 1728 cells. [00005.1] main: 0 sparts in 1728 cells. [00005.1] main: maximum depth is 0. [00005.1] main: map_cellcheck picked up 0 parts. [00005.1] main: nr of cells at depth 0 is 1728. [00005.1] engine_init: Affinity at entry: 11111111111111111111111111111111 [00005.1] engine_init: prefer NUMA-distant CPUs [00005.1] engine_init: cpu map is [ 0 8 1 9 2 10 3 11 4 12 5 13 6 14 7 15 16 24 17 25 18 26 19 27 20 28 21 29 22 30 23 31 ]. [00005.1] engine_policy: engine policies are [ steal keep numa_affinity hydro ] [00005.1] hydro_props_print: Equation of state: Ideal gas. [00005.1] hydro_props_print: Adiabatic index gamma: 1.666667. [00005.1] hydro_props_print: Hydrodynamic scheme: Gadget-2 version of SPH (Springel 2005) in 3D. [00005.1] hydro_props_print: Hydrodynamic kernel: Cubic spline (M4) with eta=1.234800 (48.00 neighbours). [00005.1] hydro_props_print: Hydrodynamic relative tolerance in h: 0.00010 (+/- 0.0144 neighbours). [00005.1] hydro_props_print: Hydrodynamic integration: CFL parameter: 0.1000. [00005.1] hydro_props_print: Hydrodynamic integration: Max change of volume: 1.40 (max|dlog(h)/dt|=0.112157). [00005.1] engine_init: Absolute minimal timestep size: 6.938894e-20 [00005.1] engine_init: Minimal timestep size (on time-line): 7.450580e-11 [00005.1] engine_init: Maximal timestep size (on time-line): 7.812500e-05 [00005.1] engine_compute_next_snapshot_time: Next output time set to t=1.000000e-03. [00005.1] engine_estimate_nr_tasks: tasks per cell estimated as: 36, maximum tasks: 62208 [00005.1] engine_init: runner 0 on cpuid=0 with qid=0. [00005.1] engine_init: runner 1 on cpuid=8 with qid=1. [00005.1] engine_init: runner 2 on cpuid=1 with qid=2. [00005.1] engine_init: runner 3 on cpuid=9 with qid=3. [00005.1] main: engine_init took 5.576 ms. [00005.1] main: Running on 6387423 gas particles, 0 star particles and 0 DM particles (0 gravity particles) [00005.1] main: from t=0.000e+00 until t=1.000e-02 with 4 threads and 4 queues (dt_min=1.000e-10, dt_max=1.000e-04)... [00005.1] engine_init_particles: Computing initial gas densities. [00005.1] space_rebuild: (re)building space [00005.1] space_regrid: h_max is 3.480e-01 (cell_min=6.989e-01). [00005.1] space_regrid: took 0.127 ms. [00005.1] space_parts_get_cell_index: took 53.611 ms. [00005.6] space_parts_sort: Sorting succeeded. [00005.6] space_parts_sort: took 499.139 ms. [00006.1] space_split: took 330.522 ms. [00006.1] space_rebuild: took 1024.499 ms. [00006.1] engine_estimate_nr_tasks: tasks per cell estimated as: 3, maximum tasks: 196938 [00006.2] scheduler_reweight: took 3.295 ms. [00006.2] engine_maketasks: took 114.094 ms (including reweight). [00006.2] engine_marktasks: took 23.221 ms. [00006.3] engine_rebuild: took 1252.517 ms. [00006.3] engine_print_task_counts: Total = 91061 (per cell = 2) [00006.3] engine_print_task_counts: task counts are [ none=0 sort=1735 self=315 pair=12345 sub_self=1637 sub_pair=13433 init_grav=0 ghost=26926 extra_ghost=0 drift_part=0 drift_gpart=0 kick1=0 kick2=0 timestep=0 send=0 recv=0 grav_top_level=0 grav_long_range=0 grav_ghost=0 grav_mm=0 grav_down=0 cooling=0 sourceterms=0 skipped=34670 ] [00006.3] engine_print_task_counts: nr_parts = 6387423. [00006.3] engine_print_task_counts: nr_gparts = 0. [00006.3] engine_print_task_counts: nr_sparts = 0. [00006.3] engine_print_task_counts: took 2.237 ms. [00018.1] engine_launch: took 11590.850 ms. [00018.1] engine_init_particles: Converting internal energy variable. [00018.4] engine_init_particles: Running initial fake time-step. [00018.5] engine_marktasks: took 18.382 ms. [00018.6] engine_print_task_counts: Total = 91061 (per cell = 2) [00018.6] engine_print_task_counts: task counts are [ none=0 sort=0 self=630 pair=24690 sub_self=3274 sub_pair=26866 init_grav=0 ghost=26926 extra_ghost=0 drift_part=0 drift_gpart=0 kick1=1735 kick2=1735 timestep=1735 send=0 recv=0 grav_top_level=0 grav_long_range=0 grav_ghost=0 grav_mm=0 grav_down=0 cooling=0 sourceterms=0 skipped=3470 ] [00018.6] engine_print_task_counts: nr_parts = 6387423. [00018.6] engine_print_task_counts: nr_gparts = 0. [00018.6] engine_print_task_counts: nr_sparts = 0. [00018.6] engine_print_task_counts: took 2.896 ms. [00028.8] engine_launch: took 10131.473 ms. [00028.8] engine_collect_timestep_and_rebuild: took 0.143 ms. [00029.0] part_verify_links: All links OK [00029.0] engine_init_particles: took 23812.652 ms. [00029.1] engine_dump_snapshot: writing snapshot at t=0.000000e+00. [00029.1] write_output_single: Snapshot and internal units match. No conversion needed. [00031.2] engine_dump_snapshot: writing particle properties took 2161.787 ms. # Step Time Time-step Updates g-Updates s-Updates Wall-clock time [ms] 0 0.000000e+00 0.000000e+00 6387423 0 0 23812.652 [00031.3] space_rebuild: (re)building space
So somewhere in
space_rebuild
? If that helpsHad another long, hard look and tried a few things, but even on my laptop with four cores,
EAGLE_12
is a bit faster with this branch... Since the threads spin while waiting for work, using more threads than physical cores will make it slower (working threads compete with spinning threads), so I didn't pursue that further.@jwillis, can you comment-out the
if (e->verbose)
at the bottom ofengine_unskip
and try both the commented-outthreadpool_map
code, as well as the newthreadpool_rmap
code below it? If there's a large difference between the two, could you also run both master and this branch in VTune to see where all the time is going?Cheers!
[00004.5] engine_policy: engine policies are [ steal keep numa_affinity hydro ] [00004.5] hydro_props_print: Equation of state: Ideal gas. [00004.5] hydro_props_print: Adiabatic index gamma: 1.666667. [00004.5] hydro_props_print: Hydrodynamic scheme: Gadget-2 version of SPH (Springel 2005) in 3D. [00004.5] hydro_props_print: Hydrodynamic kernel: Cubic spline (M4) with eta=1.234800 (48.00 neighbours). [00004.5] hydro_props_print: Hydrodynamic relative tolerance in h: 0.00010 (+/- 0.0144 neighbours). [00004.5] hydro_props_print: Hydrodynamic integration: CFL parameter: 0.1000. [00004.5] hydro_props_print: Hydrodynamic integration: Max change of volume: 1.40 (max|dlog(h)/dt|=0.112157). [00004.5] engine_init: Absolute minimal timestep size: 6.938894e-20 [00004.5] engine_init: Minimal timestep size (on time-line): 7.450580e-11 [00004.5] engine_init: Maximal timestep size (on time-line): 7.812500e-05 [New Thread 0x7fffad430700 (LWP 53229)] [New Thread 0x7fffaca2f700 (LWP 53230)] [New Thread 0x7fffac02e700 (LWP 53231)] [New Thread 0x7fffab62d700 (LWP 53232)] [New Thread 0x7fffaac2c700 (LWP 53233)] [New Thread 0x7fffaa22b700 (LWP 53234)] [New Thread 0x7fffa982a700 (LWP 53235)] [00004.6] main: engine_init took 16.951 ms. [00004.6] main: Running on 6387423 gas particles, 0 star particles and 0 DM particles (0 gravity particles) [00004.6] main: from t=0.000e+00 until t=1.000e-02 with 4 threads and 4 queues (dt_min=1.000e-10, dt_max=1.000e-04)... [00004.6] engine_init_particles: Computing initial gas densities. [00017.1] engine_init_particles: Converting internal energy variable. [00017.5] engine_init_particles: Running initial fake time-step. # Step Time Time-step Updates g-Updates s-Updates Wall-clock time [ms] 0 0.000000e+00 0.000000e+00 6387423 0 0 21993.252 ^C Program received signal SIGINT, Interrupt. threadpool_rchomp (tp=0x7fffffff8f30, rmap_function=Unhandled dwarf expression opcode 0xf3 ) at threadpool.c:183 183 if (tp->rmap_waiting == 0) return; Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.166.el6_7.7.x86_64 numactl-2.0.9-2.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) info threads 8 Thread 0x7fffa982a700 (LWP 53235) 0x000000346360c5ac in pthread_barrier_wait () from /lib64/libpthread.so.0 7 Thread 0x7fffaa22b700 (LWP 53234) 0x000000346360c5ac in pthread_barrier_wait () from /lib64/libpthread.so.0 6 Thread 0x7fffaac2c700 (LWP 53233) 0x000000346360c5ac in pthread_barrier_wait () from /lib64/libpthread.so.0 5 Thread 0x7fffab62d700 (LWP 53232) 0x000000346360c5ac in pthread_barrier_wait () from /lib64/libpthread.so.0 4 Thread 0x7fffac02e700 (LWP 53231) threadpool_rchomp (tp=0x7fffffff8f30, tid=Unhandled dwarf expression opcode 0xf3 ) at threadpool.c:183 3 Thread 0x7fffaca2f700 (LWP 53230) threadpool_rchomp (tp=0x7fffffff8f30, tid=Unhandled dwarf expression opcode 0xf3 ) at threadpool.c:183 2 Thread 0x7fffad430700 (LWP 53229) threadpool_rchomp (tp=0x7fffffff8f30, tid=Unhandled dwarf expression opcode 0xf3 ) at threadpool.c:183 * 1 Thread 0x7ffff686bb20 (LWP 53145) threadpool_rchomp (tp=0x7fffffff8f30, rmap_function=Unhandled dwarf expression opcode 0xf3 ) at threadpool.c:183
Added 1 commit:
- 536f0d48 - mark the rmap_data as volatile, make sure the loop in threadpool_rchomp actually…