Skip to content
Snippets Groups Projects

Gpart mpi io

Merged Matthieu Schaller requested to merge gpart_mpi_io into master

We can now do i/o with multiple types using the serial version. I may implement a parallel-hdf5 version later but as we (I) still need to work on i/o not sure that is the highest priority.

The code crashes later in the exchange of strays but that's what @nnrw56 is working on.

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • The old SodShock will no longer run:

     > mpirun -np 2 ../swift_mpi -t 4 -f sodShock.hdf5 -m 0.01 -w 5000 -c 1. -d 1e-7 -e 0.01
    .
    .
    .
    [0000] [00000.2] engine_init: Minimal timestep size (on time-line): 5.960464e-08
    [0000] [00000.2] engine_init: Maximal timestep size (on time-line): 7.812500e-03
    [0000] [00000.4] engine_split: Re-allocating parts array from 512064 to 614476.
    HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 47863001370560:
      #000: ../../../src/H5G.c line 463 in H5Gopen2(): unable to open group
        major: Symbol table
        minor: Can't open object
      #001: ../../../src/H5Gint.c line 320 in H5G__open_name(): group not found
        major: Symbol table
        minor: Object not found
      #002: ../../../src/H5Gloc.c line 430 in H5G_loc_find(): can't find object
        major: Symbol table
        minor: Object not found
      #003: ../../../src/H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
        major: Symbol table
        minor: Object not found
      #004: ../../../src/H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
        major: Symbol table
        minor: Callback failed
      #005: ../../../src/H5Gloc.c line 385 in H5G_loc_find_cb(): object 'PartType1' doesn't exist
        major: Symbol table
        minor: Object not found
    [0000] [00001.1] serial_io.c:write_output_serial():753: Error while opening particle group /PartType1.
    application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
  • Matthieu Schaller Added 3 commits:

    Added 3 commits:

    • 93b372e5 - Uninitialised variables triggering warnings in DDT
    • 3d5a42df - Better check for whether or not to create the HDF5 groups for a given particle type
    • 4b81f715 - Merge branch 'gpart_mpi_io' of gitlab.cosma.dur.ac.uk:swift/swiftsim into gpart_mpi_io
  • Added 1 commit:

    • 6bf681f3 - Better check for whether or not to create the HDF5 groups for a given particle type
  • Yes, that is expected. Since the gparts are not (re-)distributed correctly over MPI, we end up with incorrect number of particles on each rank and inconsistencies between parts and gparts. If you don't redistribute everything works.

    But I should have been clear about the fact that this only works in conjunction with @nnrw56's work on the exchanges.

  • I see, so I won't accept this yet and will move to WIP.

    BTW, I merged engine_exchange_strays into this branch and that does not work either.

  • Peter W. Draper Title changed from Gpart mpi io to WIP: Gpart mpi io

    Title changed from Gpart mpi io to WIP: Gpart mpi io

  • Peter, how does it fail?

  • Ran the test above and got:

    .
    .
    .
    [0000] [00000.4] engine_split: Re-allocating parts array from 512064 to 614476.
    [0000] [00001.9] main: Running on 1024128 gas particles and 0 DM particles until t=1.000e+00 with 4 threads and 4 queues (dt_min=1.000e-07, dt_max=1.000e-02)...
    [0000] [00001.9] engine_init_particles: Initialising particles
    [0000] [00001.9] engine.c:engine_exchange_strays():659: Do not have a proxy for the requested nodeID 0 for part with id=47748781199296, x=[2.980766e-01,4.629061e-02,2.251668e-02].
    application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
  • OK, thanks, will see if I can find out what caused this tonight.

  • It fails because engine_redistribute() does not exchange gparts.

    We fail before actually reaching the tasks part of the code. So the updates to exchange_strays() won't solve the problem.

    I hacked something to make engine_redistribute() work with gparts and then the i/o works and we die later on in the code. The problem with my implementation of engine_redistribute() is that I don't deal with the linking of part-gpart. We need to agree on how to do this first.

    Edited by Matthieu Schaller
  • Added 1 commit:

    • 669d0999 - Temporary fix to preserve master
  • Matthieu Schaller Title changed from WIP: Gpart mpi io to Gpart mpi io

    Title changed from WIP: Gpart mpi io to Gpart mpi io

  • Ok, this is now ready to go. There are a few lines in the main() that de-allocate the array of gparts that has been freshly read-in. This allows the code to operate normally after that.

    Just comment it out to test the gparts in MPI-mode further down the code.

  • Added 1 commit:

    • ae787120 - Added one command-line option '-g' to switch on gravity (i.e. not de-allocate th…
  • Actually, I have modified the main() by adding a command-line option.

    • If you run normally, the gparts get de-allocated and we have the old behaviour with everything working smoothly.
    • If you add -g, you switch on gravity, which will preserve the gparts and add the gravity policy to the mask.

    This latter option should allow @tt and @jregan to keep working on their branch after they have merged master into theirs.

383 389 N_total[0] = Ngas;
384 390 N_total[1] = Ngpart - Ngas;
385 391 message("Read %lld gas particles and %lld DM particles from the ICs",
386 N_total[0], N_total[1]);
392 N_total[0], N_total[1]);
387 393 #endif
388 394
395 /* MATTHIEU: Temporary fix to preserve master */
396 if (!with_gravity) {
397 free(gparts);
  • 383 389 N_total[0] = Ngas;
    384 390 N_total[1] = Ngpart - Ngas;
    385 391 message("Read %lld gas particles and %lld DM particles from the ICs",
    386 N_total[0], N_total[1]);
    392 N_total[0], N_total[1]);
    387 393 #endif
    388 394
    395 /* MATTHIEU: Temporary fix to preserve master */
    396 if (!with_gravity) {
    397 free(gparts);
  • 383 389 N_total[0] = Ngas;
    384 390 N_total[1] = Ngpart - Ngas;
    385 391 message("Read %lld gas particles and %lld DM particles from the ICs",
    386 N_total[0], N_total[1]);
    392 N_total[0], N_total[1]);
    387 393 #endif
    388 394
    395 /* MATTHIEU: Temporary fix to preserve master */
    396 if (!with_gravity) {
    397 free(gparts);
  • 383 389 N_total[0] = Ngas;
    384 390 N_total[1] = Ngpart - Ngas;
    385 391 message("Read %lld gas particles and %lld DM particles from the ICs",
    386 N_total[0], N_total[1]);
    392 N_total[0], N_total[1]);
    387 393 #endif
    388 394
    395 /* MATTHIEU: Temporary fix to preserve master */
    396 if (!with_gravity) {
    397 free(gparts);
    • Yes! When we loop through the parts to re-link we do part[k].gpart->id_or_neg_offset = ... irrespective of the number of gparts.

  • Added 1 commit:

  • 383 390 N_total[0] = Ngas;
    384 391 N_total[1] = Ngpart - Ngas;
    385 392 message("Read %lld gas particles and %lld DM particles from the ICs",
    386 N_total[0], N_total[1]);
    393 N_total[0], N_total[1]);
    387 394 #endif
    388 395
    396 /* MATTHIEU: Temporary fix to preserve master */
    397 if (!with_gravity) {
  • @nnrw56 Are you looking into adapting engine_redistribute() ? Or should I give it a go ?

  • Branch can go.

  • I thought of doing it myself since I already know the code and would probably not spend more than two evenings on it.

    Having said that, though, if you want to do it to get to better know that part of the code, then it's all yours :)

  • Seems to run the old tests, so accepting.

  • Peter W. Draper Status changed to merged

    Status changed to merged

  • Peter W. Draper mentioned in commit 34e76452

    mentioned in commit 34e76452

  • mentioned in issue #127 (closed)

  • mentioned in issue #130 (closed)

  • Please register or sign in to reply
    Loading