Skip to content
Snippets Groups Projects

Change the level at which the sort tasks are set

Closed Matthieu Schaller requested to merge sort_level into master

This introduces the two changes I mentioned this morning:

  • Move the stars resort to a deeper level.
  • Move the hydro and stars sort tasks to two levels below the super-level

This increases parallelism and hence reduces dead time.

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • added 1 commit

    • fbb2faf4 - Also clear the sub-sort flag in the lower-level cells when aborting the recursion in the sorts

    Compare with previous version

  • All my tests have run. @pdraper would you be able to also check that this did not break anything? Thanks!

  • Yes, sorry meant to look at this earlier.

  • Didn't get far. When running the standard SodShock I see:

    [00022.6] runner_sort.c:runner_do_hydro_sort():234: Sorting un-drifted cell c->nodeID=0

    that's after 266 steps. Happily seems repeatable.

  • I configured with ./configure --enable-debug --enable-debugging-checks on my laptop and just ran ./run.sh in examples/HydroTests/SodShock_3D without any issues.

    Note that step 266 is the final step in my case.

    Are you trying something different?

  • No, nothing too different. My configure line is:

    ./configure --with-parmetis --enable-sanitizer --enable-undefined-sanitizer --enable-debug

    Naturally using gnu_comp/7.3.0 to get the sanitizers enabled.

  • It's also disappeared for me now. So either I made an error, or the repeatability was a fluke. Will keep looking, but you may as well not, for now.

  • Silly me, doing too many things at once. I really meant the SedovBlast not the SodShock. That is repeatable.

  • Ah good. I can see this fail as well.

    Now... what is it that happens here but never in the EAGLE-25...

  • added 1 commit

    • c413c53e - Activate the drifts at the super-level alongside the sorts

    Compare with previous version

  • This last commit fixes that test case. I think I forgot the case of rather shallow trees.

  • Thanks, has fixed this issue. Will try some more tests.

  • Seeing a problem I've not come across for quite some time:

    [0010] [00061.9] engine.c:engine_addlink():160: Link table overflow. Increase the value of `Scheduler:links_per_tasks`.
    

    This is for EAGLE_50/256 running on 32 ranks of COSMA5 and fullish physics: --with-subgrid=EAGLE:

    mpirun -np $SLURM_NTASKS ../../swift_mpi --pin --cooling --star-formation --feedback --stars --cosmology --hydro --self-gravity -v 1 --threads=16 eagle_50.yml

    Trying again with this value raised to 50 from 25.

  • I had not needed to increase this for the EAGLE-25 on 4 (8?) ranks but I am not surprised that this can happen. I could update the heuristic decision making to be robust out of the box.

  • This value is a fixed one (unlike tasks per cell), so we'll need a new canonical value.

  • So had to increase this value to 200, from 25, so a bit of a leap. That shows:

    Nr. of links: 5082759 allocated links: 6889600 ratio: 0.737744 memory use: 105 MB.

    So not that far above the actual requirement, also not a lot of memory, but I'm confused by all this, the actual reports suggest that a value of 2 should do:

    Actual usage: tasks/cell: 1.363218 links/task: 1.957144

    Which is about right as the number of tasks is about half the number of links. Going to look closely at that code.

  • Could there be a large node-to-node variation? The message is usually only printed on rank 0.

  • Cannot tell as I only used verbose 1, but this is EAGLE_50 at high redshift, so should be uniformly distributed.

  • I see, so we are using a different count of the no. tasks for these two uses. The first time we have a task count, but still have a lot of work to do creating more, but we have to start creating the links. Anyway the usage report is incorrect as that should use the first task count, not the final one. If we correct that we see:

    Nr. of links: 6880308 allocated links: 13294000 ratio: 0.517550 memory use: 202 MB.
    Actual usage: tasks/cell: 1.517805 links/initial task: 103.509972

    Not obviously sensible, but no longer misleading.

  • Anyway the test itself is now running, so I'll leave it running for a while longer.

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
Please register or sign in to reply
Loading