Abstract: Checkpointing is one of the fundamental techniques to resume training while system fails. It has been generally used in various domains, such as high-performance computing (HPC) and machine ...