Skip to content
Snippets Groups Projects

Parallel space_rebuild()

Merged Matthieu Schaller requested to merge parallel_get_index into master

Looking at some of the scaling results, it turns out that the last remaining significant chunk of non-parallel code is space_rebuild() and the majority of the time in there is spent computing the cell index of the particles. This can easily done in parallel and on the EAGLE_25 shows significant improvements in the code speed and scalability. Although I should say that this comes from running this on 16 cores only and based on the vTune outputs (which usually match the actual tests).

What do you think ?

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
817 /* Get the particle */
818 struct gpart *restrict gp = &gparts[k];
819
820 const double old_pos_x = gp->x[0];
821 const double old_pos_y = gp->x[1];
822 const double old_pos_z = gp->x[2];
823
824 /* Put it back into the simulation volume */
825 const double pos_x = box_wrap(old_pos_x, 0.0, dim_x);
826 const double pos_y = box_wrap(old_pos_y, 0.0, dim_y);
827 const double pos_z = box_wrap(old_pos_z, 0.0, dim_z);
828
829 /* Get its cell index */
830 const int index =
831 cell_getid(cdim, pos_x * ih_x, pos_y * ih_y, pos_z * ih_z);
832 ind[k] = index;
  • What would happen if we were to atomically increment the cell.count values here? Is this slower than writing the ind array and looping over it in a second pass?

  • Just one comment on something you've probably already tried, otherwise LGTM!

  • Reassigned to @pdraper

  • 817 /* Get the particle */
    818 struct gpart *restrict gp = &gparts[k];
    819
    820 const double old_pos_x = gp->x[0];
    821 const double old_pos_y = gp->x[1];
    822 const double old_pos_z = gp->x[2];
    823
    824 /* Put it back into the simulation volume */
    825 const double pos_x = box_wrap(old_pos_x, 0.0, dim_x);
    826 const double pos_y = box_wrap(old_pos_y, 0.0, dim_y);
    827 const double pos_z = box_wrap(old_pos_z, 0.0, dim_z);
    828
    829 /* Get its cell index */
    830 const int index =
    831 cell_getid(cdim, pos_x * ih_x, pos_y * ih_y, pos_z * ih_z);
    832 ind[k] = index;
  • 817 /* Get the particle */
    818 struct gpart *restrict gp = &gparts[k];
    819
    820 const double old_pos_x = gp->x[0];
    821 const double old_pos_y = gp->x[1];
    822 const double old_pos_z = gp->x[2];
    823
    824 /* Put it back into the simulation volume */
    825 const double pos_x = box_wrap(old_pos_x, 0.0, dim_x);
    826 const double pos_y = box_wrap(old_pos_y, 0.0, dim_y);
    827 const double pos_z = box_wrap(old_pos_z, 0.0, dim_z);
    828
    829 /* Get its cell index */
    830 const int index =
    831 cell_getid(cdim, pos_x * ih_x, pos_y * ih_y, pos_z * ih_z);
    832 ind[k] = index;
    • OK, thanks for clarifying!

      If the second pass becomes a bottleneck, we could always consider keeping a separate list of per-cell counts for each runner that atomically updates the global list only at the end of a chunk.

  • Works for me, so accepting.

  • Peter W. Draper Status changed to merged

    Status changed to merged

  • Peter W. Draper mentioned in commit e6bc5754

    mentioned in commit e6bc5754

  • Missed a problem that should be resolved in f15c30bb. Can you check that. Fixes master for me.

  • That works for me. Sorry for missing out the parallel case.

  • Please register or sign in to reply
    Loading