WIP: Allow compression in snapshots written in parallel
This adds a new parameter Snapshots:mpi_compression which can be enabled to write compressed snapshots in parallel. It only takes effect for HDF5 1.10.2 or later. It's a separate parameter because at least in HDF5 1.10.2 the feature is labelled as 'experimental', so we probably don't want to use it by default.
I also found that swift was always writing snapshots in independent mode, which is incompatible with compression. This looks like a mistake because there was a call to set H5FD_MPIO_COLLECTIVE on a property list that was never used in writeArray(). I've modified the code to always use collective mode for parallel writes. It was already using collective mode when reading snapshots.
Still need to investigate what effect this has on performance. It might be quite poor because all the data from one MPI rank is compressed by a single thread in the H5Dwrite call.
Merge request reports
Activity
Took advantage of the empty system over the weekend to make some tests.
These are all with these modules:
intel_comp/2018 intel_mpi/2018 parallel_hdf5/1.10.3 fftw/3.3.7 gsl/2.4 parmetis/4.0.3
and SWIFT compiled with--enable-ipo --with-tbbmalloc
and nothing else. I am running an EAGLE-50 at z=0.1 replicated 4x times (i.e. an EAGLE-200) on 64 nodes using 128 MPI ranks on cosma7. The directory where I write uses 64 stripes. The only difference to the eagle_50.yml is the use of 64 top-level cells instead of 16.I switched on
IO_SPEED_MEASUREMENT
inparallel_io.c
to get timings.Here is what I get for master: timings_master.txt
and I got this for a version where I add the
H5Pset_dxpl_mpio(h_plist_id, H5FD_MPIO_COLLECTIVE);
to theH5Dwrite()
: timings_parallel.txt(ignore the raw clock ticks in the [] at the start of each line).
Running the whole thing multiple times shows the same thing with numbers varying a little and one or the other scheme winning over the other.
My conclusion: Using or not
H5Pset_dxpl_mpio(h_plist_id, H5FD_MPIO_COLLECTIVE);
makes no difference. However, it is closer to HDF5's own examples so we may as well add it to reduce confusion.The other thing to notice is that we only achieve big speed when dumping a relative large array. As expected I guess.
Regarding the use of parallel compression, doing exactly what this branch does and setting the compression to 4 with the exact same setup as above I crash with the following:
[0000] [01295.2] engine_dump_snapshot: Dumping snapshot at a=9.090909e-01 [0000] [01297.1] write_output_parallel: Preparing file on rank 0 took 57.641 ms. [0000] [01308.2] write_output_parallel: Setting parallel HDF5 access properties took 11058.375 ms. [0000] [01308.2] write_output_parallel: Opening HDF5 file took 16.344 ms. [0000] [01308.2] write_output_parallel: Snapshot and internal units match. No conversion needed. [0000] [01309.2] writeArray_chunk: Copying for 'Coordinates' took 314.380 ms. [0000] [01380.3] writeArray_chunk: H5Dwrite for 'Coordinates' (262016 MB) took 71067.578 ms (speed = 3686.857030 MB/s). [0000] [01380.4] writeArray: Need to redo one iteration for array 'Coordinates' [0000] [01380.8] writeArray_chunk: Copying for 'Coordinates' took 310.571 ms. [0000] [01471.3] writeArray_chunk: H5Dwrite for 'Coordinates' (262016 MB) took 90553.719 ms (speed = 2893.486912 MB/s). [0000] [01471.4] writeArray: Need to redo one iteration for array 'Coordinates' [0000] [01471.6] writeArray_chunk: Copying for 'Coordinates' took 136.182 ms. [0000] [01525.3] writeArray_chunk: H5Dwrite for 'Coordinates' (68224 MB) took 53702.363 ms (speed = 1270.409640 MB/s). [0000] [01525.4] writeArray: 'Coordinates' took 217137.413 ms. File Edit Options Buffers Tools Help Error in ADIOI_Calc_aggregator(): rank_index(85146) >= fd->hints->cb_nodes (64) fd_size=4683776 off=398827827200 application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0 Error in ADIOI_Calc_aggregator(): rank_index(55449) >= fd->hints->cb_nodes (64) fd_size=4683776 off=259732557824 application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1 Error in ADIOI_Calc_aggregator(): rank_index(129) >= fd->hints->cb_nodes (64) fd_size=4683776 off=626327552 application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
It does seem that we crash when writing the remainder array. Recall that we write data in chunks (not hdf5's chunks) of 2GB per MPI rank as this is the limit set by the underlying MPI-IO library. When we have more than that we write in multiple bits and may end up with a remainder where not all ranks participate. It could be that this is not yet well supported by hdf5 in version 1.10.3. This was an issue as well (without compression) in early 1.8.x versions of the library.
The other thing to note is that the array bit that were successfully written took >10x more time to do so.
Edited by Matthieu SchallerI think the advantage of collective mode is supposed to be that it allows MPI-IO to decide how many nodes should do the writing. That might help if we have a combination of file system and job size that means having all of them write at once is not the best strategy.
Collective writes only work if all ranks in the communicator call h5dwrite. If the code doesn't do so already, we could make ranks with no data call h5dwrite with an empty selection. But I find it a bit surprising that it would crash instead of hang if this is the problem.
That speed measurement is disappointing, but maybe not a big surprise.
All ranks always call H5Dwrite and indeed use an empty selection if there is nothing to write. And this works perfectly well if no compression is used. They may not have tested that case however with compression on. Or maybe this is a point unrelated to the crash itself.
And 1.10.5 may fix this since the feature is not quite stable yet.
I think the speed measurement shows that as long as the file was opened with the right properties it does collective writes. The fully serial code, where one rank writes after the other, is more than 30x slower on that example.
Yes, sorry, I meant that the factor of 10 slower with compression on is a bit disappointing. And thanks for taking a look at this. Would have been useful for eagle-xl if we could rely on it.
Edited by John HellyOne extra thought was that we are currently in an MPI + X context. The compression is only done by one thread of each MPI rank. Since the compression itself seems to be the bottleneck, I wonder whether there is a way for hdf5 to use a thread-parallel compression. I'll bring this up with the hdf5 people I spoke to before about bugs and see what they think.