Best Practices: Checkpointing Preemptible Training Workloads
Where to Save Checkpoints
runai tensorflow submit train-with-checkpoints -i
tensorflow/tensorflow:1.14.0-gpu-py3 --host-path
mount=/mnt/nfs_share/john, path=/mydir -g 1 --working-dir /mydir
--command -- ./startup.shWhen to Save Checkpoints
Save Periodically
checkpoints_file = "weights.best.hdf5"
checkpoint = ModelCheckpoint(checkpoints_file, monitor='val_acc', verbose=1,
save_best_only=True, mode='max')Save on Exit Signal
Grace Period for Preemption
Resuming with Saved Checkpoints
Sample Code
Last updated