WIP: Allow compression in snapshots written in parallel

Took advantage of the empty system over the weekend to make some tests.

These are all with these modules: intel_comp/2018 intel_mpi/2018 parallel_hdf5/1.10.3 fftw/3.3.7 gsl/2.4 parmetis/4.0.3 and SWIFT compiled with --enable-ipo --with-tbbmalloc and nothing else. I am running an EAGLE-50 at z=0.1 replicated 4x times (i.e. an EAGLE-200) on 64 nodes using 128 MPI ranks on cosma7. The directory where I write uses 64 stripes. The only difference to the eagle_50.yml is the use of 64 top-level cells instead of 16.

I switched on IO_SPEED_MEASUREMENT in parallel_io.c to get timings.

Here is what I get for master: timings_master.txt

and I got this for a version where I add the H5Pset_dxpl_mpio(h_plist_id, H5FD_MPIO_COLLECTIVE); to the H5Dwrite(): timings_parallel.txt

(ignore the raw clock ticks in the [] at the start of each line).

Running the whole thing multiple times shows the same thing with numbers varying a little and one or the other scheme winning over the other.

My conclusion: Using or not H5Pset_dxpl_mpio(h_plist_id, H5FD_MPIO_COLLECTIVE); makes no difference. However, it is closer to HDF5's own examples so we may as well add it to reduce confusion.

The other thing to notice is that we only achieve big speed when dumping a relative large array. As expected I guess.

Regarding the use of parallel compression, doing exactly what this branch does and setting the compression to 4 with the exact same setup as above I crash with the following:

[0000] [01295.2] engine_dump_snapshot: Dumping snapshot at a=9.090909e-01
[0000] [01297.1] write_output_parallel: Preparing file on rank 0 took 57.641 ms.
[0000] [01308.2] write_output_parallel: Setting parallel HDF5 access properties took 11058.375 ms.
[0000] [01308.2] write_output_parallel: Opening HDF5 file  took 16.344 ms.
[0000] [01308.2] write_output_parallel: Snapshot and internal units match. No conversion needed.
[0000] [01309.2] writeArray_chunk: Copying for 'Coordinates' took 314.380 ms.
[0000] [01380.3] writeArray_chunk: H5Dwrite for 'Coordinates' (262016 MB) took 71067.578 ms (speed = 3686.857030 MB/s).
[0000] [01380.4] writeArray: Need to redo one iteration for array 'Coordinates'
[0000] [01380.8] writeArray_chunk: Copying for 'Coordinates' took 310.571 ms.
[0000] [01471.3] writeArray_chunk: H5Dwrite for 'Coordinates' (262016 MB) took 90553.719 ms (speed = 2893.486912 MB/s).
[0000] [01471.4] writeArray: Need to redo one iteration for array 'Coordinates'
[0000] [01471.6] writeArray_chunk: Copying for 'Coordinates' took 136.182 ms.
[0000] [01525.3] writeArray_chunk: H5Dwrite for 'Coordinates' (68224 MB) took 53702.363 ms (speed = 1270.409640 MB/s).
[0000] [01525.4] writeArray: 'Coordinates' took 217137.413 ms.

File Edit Options Buffers Tools Help                                                                                                                                                          
Error in ADIOI_Calc_aggregator(): rank_index(85146) >= fd->hints->cb_nodes (64) fd_size=4683776 off=398827827200
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
Error in ADIOI_Calc_aggregator(): rank_index(55449) >= fd->hints->cb_nodes (64) fd_size=4683776 off=259732557824
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
Error in ADIOI_Calc_aggregator(): rank_index(129) >= fd->hints->cb_nodes (64) fd_size=4683776 off=626327552
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2

It does seem that we crash when writing the remainder array. Recall that we write data in chunks (not hdf5's chunks) of 2GB per MPI rank as this is the limit set by the underlying MPI-IO library. When we have more than that we write in multiple bits and may end up with a remainder where not all ranks participate. It could be that this is not yet well supported by hdf5 in version 1.10.3. This was an issue as well (without compression) in early 1.8.x versions of the library.

The other thing to note is that the array bit that were successfully written took >10x more time to do so.

We don't have a more recent version to test whether things have improved.

I think the advantage of collective mode is supposed to be that it allows MPI-IO to decide how many nodes should do the writing. That might help if we have a combination of file system and job size that means having all of them write at once is not the best strategy.

Collective writes only work if all ranks in the communicator call h5dwrite. If the code doesn't do so already, we could make ranks with no data call h5dwrite with an empty selection. But I find it a bit surprising that it would crash instead of hang if this is the problem.

That speed measurement is disappointing, but maybe not a big surprise.

All ranks always call H5Dwrite and indeed use an empty selection if there is nothing to write. And this works perfectly well if no compression is used. They may not have tested that case however with compression on. Or maybe this is a point unrelated to the crash itself.

And 1.10.5 may fix this since the feature is not quite stable yet.

I think the speed measurement shows that as long as the file was opened with the right properties it does collective writes. The fully serial code, where one rank writes after the other, is more than 30x slower on that example.

Yes, sorry, I meant that the factor of 10 slower with compression on is a bit disappointing. And thanks for taking a look at this. Would have been useful for eagle-xl if we could rely on it.

I'd also be wary of the experimental status that they still report.

One extra thought was that we are currently in an MPI + X context. The compression is only done by one thread of each MPI rank. Since the compression itself seems to be the bottleneck, I wonder whether there is a way for hdf5 to use a thread-parallel compression. I'll bring this up with the hdf5 people I spoke to before about bugs and see what they think.

If that's OK with you, I'll close this for now and push the change to H5Dwrite() to master.

We'll revisit the parallel compression when I hear back from hdf5.

closed

WIP: Allow compression in snapshots written in parallel

Closed by Matthieu Schaller 5 years ago (Sep 30, 2019 1:44pm UTC) 5 years ago

Activity

WIP: Allow compression in snapshots written in parallel

Merge request reports

Closed by Matthieu Schaller 5 years ago (Sep 30, 2019 1:44pm UTC) 5 years ago

Activity