New output format, continuous snapshots

I'm creating this issue to get the discussion started on a new file format to replace the ubiquitous snapshots.

Why? Snapshots are a pain for several reasons:

First and foremost, they provide only very limited information on the dynamics of a simulation, e.g. the fastest-moving particles are under-resolved while the slowest-moving particles are over-resolved.
Furthermore, they're a pain to dump since all nodes write to the global file system at the same time.
Finally, they're just too much data.

What? So what should we do differently? The main ideas are:

Do not create periodic snapshots, but to continuously log the movement of particles during the simulation.
The particles should be logged in separate files per MPI rank.
The particle data, e.g. position, velocity, energy, smoothing length, and any other quantity of interest, are only logged once they have changed significantly.
The data for any particle can be recovered at arbitrary time points by interpolating between the two (or more) nearest logged points.

The main idea here is, analogously to multiple timestepping, to only record particle data when it becomes necessary. The goal is to have sufficient data in a sufficiently compact form such that the entire dynamics of a simulation can be reconstructed within a reasonable error tolerance and reasonable storage requirements.

How? This is where things get interesting. Assume, for the time being, that we are running a simulation on a single node/rank. The log consists of two types of files: a single data file, and a set of index files. The data file is just a chronological dump of particle data. The index files, which are generated at fixed intervals, map particle IDs to the last offset in the data file where they were dumped.

The particle data is stored in chunks. Each chunk starts with a header containing a char, which specifies the level of detail of the information in that chunk, followed by a varint (or just a 64-bit value for simplicity) encoding the negative offset of the previous chunk for the same particle. E.g. if we're going to log a chunk for a particle at level=1, and the last chunk describing that particle was written 1234 byes behind the current end of the log, then we start the chunk by writing "1, 1234".

The size of the data after the header depends on the chunk level. The levels may be encoded as follows, with each bit encoding whether the data is appended or not

Level bit	Data	Size (bytes)	Comments
0	`float dx[3]`	12	Change in position since the previous log.
1	`double x[3]`	24	Absolute particle positions.
2	`float v[3]`	12	Particle velocity.
3	`float u`	4	Particle energy.
4	`float h`	4	Particle smoothing length.
5	`float rho`	4	Particle density.
6	`float m`, `int64 id`, etc...	??	Constant particle data.

E.g. a level of 1, in binary 1101, would contain the change in particle position, the particle velocity, and the particle energy. For compactness, e.g. to save a bit, the first bit could be interpreted as storing the position change if zero, or the absolute position if one. At the start (and end?) of a simulation, all particles would be logged at the highest level.

Reading particle data. Given such a format, getting the particle data for any given time point is a simple operation:

Load the last generated index file.
Get from the index the offset of the last recorded chunk with the particle of interest.
Read and parse the chunk in the data at the indexed offset.
Using the offset in the index header, skip through the file backwards until we have two chunks that contain the data of interest, one before and one after the requested time point.
Interpolate the data at the required time point.

The obvious question here is: What is time? Simulation time is just an offset in the data file, and at each timestep we can write a chunk, a fake particle if you wish, recording the passage of time. The time at which a chunk of data was stored is then the time stored in the next time-chunk in the file.

Since most people work with snapshots, we could provide a tool that, given a set of data and index files, extracts the particles in a given volume at a given time. That should be enough to get most people started.

What about MPI? In a distributed simulation, we can generate a data file per node/rank. Whenever a particle leaves a given rank, it is stored at the highest level of detail (modulo constants) and removed from its index. Whenever a particle is added to a given rank, it is logged at the highest level of detail and added to that rank's index. Queries work the same way, except that they are run over the files from all ranks.

Implementation details. I think the best way to implement the data logging is to use mmap, i.e. a memory-mapped file. Chunks can be allocated in parallel by atomically increasing a pointer to the end of the file (in memory), the actual writing of the data is handled by the OS paging and is a lot faster than calling write to a regular file.

The particle index is just a sorted list of pairs of [uint64 id, size_t offset], with a special entry for the last time chunk offset. The only (minor) tricky thing here is keeping track of particles that enter/leave a rank between index dumps.

Open questions. Before writing a single line of code, there are a number of questions that have to be addressed. Here are what I think are the most pertinent:

What is a good strategy for deciding when to dump what values? If we use linear interpolation, then the interpolation error is a function of the quantity's second derivative. We can use that as a trigger for dumping a quantity, similarly to setting particle timesteps.
Do the proposed levels really make sense? Are there values that we would always want to dump together? Or mutually exclusive values?
Given a dumping strategy and a set of levels, how much data would we generate for e.g. an Eagle run? What kind of values for the interpolation error would generate the same amount of data as the current snapshot regime?
Are there any corner cases that do not fit well into this design?

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information