Erasure encoded restart files.
Fun idea from Alastair Basden. To protect against the loss of single OSTs, which can happen on the COSMA snap file stores, why not write out the restart files using an erasure encoded technique. This is the way that RAID devices can operate, but working at the level of files.
So instead of writing one restart file, we write let's say 6 files and we arrange to erasure encode these so that the loss of one file isn't fatal as the missing data can be reconstructed from the remaining 5.
Could be code out there that does this already, but clearly we would need a lot of effort to make sure this worked and didn't impact performance too much. It would also have a 20% overhead, which is better than a simple duplicate.
Also note we would need to make sure each file was written to a different OST.