Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save checkpoint to temporary directory to handle partial saves during failures #35580

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

SilverSoldier
Copy link
Contributor

@SilverSoldier SilverSoldier commented Jan 9, 2025

What does this PR do?

When auto-resuming from checkpoint, currently the "max" folder is picked as the checkpoint folder to resume from. If there are some missing files (like model, config or trainer_state), this results in FileNotFoundError. This PR picks the latest checkpoint folder (should ideally be last or second last) with all required files allowing training to resume instead of just throwing an error requiring manual removing of files.

Fixes #35782

Currently, I test for atleast one of the model weights along with definitely requiring config and trainer_state. From what I understood the others (like optimizer) were optional, but please let me know if I missed something.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Trainer: @muellerzr and @SunMarc

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR ! Left a comment. Can you double check @muellerzr ?

Comment on lines 241 to 244
for checkpoint in sorted(checkpoints, key=lambda x: int(_re_checkpoint.search(x).groups()[0]), reverse=True):
if is_valid_checkpoint_dir(checkpoint):
break
return os.path.join(folder, checkpoint)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be good to display a warning to the user, specifying that we decided to skip a particular checkpoint because it is missing files.

@rwightman
Copy link
Contributor

Checking for existence of files is only a partial solution to this problem, you can have partially written files in a crash during save.

The needed fixes include

  • error handling on load: detect a failure during any part of the load (crash on loading a partial/corrupted checkpoint) and execute logic to find and try the preceeding one
  • improve checkpoint writing: write to temporary files (or folders in this case) and swap to correct, final name(s) once all writes have succesfully finished. This is a common practice when writing checkpoints / restore states

@SilverSoldier
Copy link
Contributor Author

@rwightman thanks for your comments.

Checking for existence of files is only a partial solution to this problem, you can have partially written files in a crash during save.

You're right, but, I believe the underlying write to file calls usually handle this issue of partial file when crashing. For instance, safetensors.save uses numpy.save which this SO issue mentions doesn't create file until success. Even in my own runs, I hadn't faced that issue. It could happen due to edge cases, but I believe much less likely than this case.

error handling on load: detect a failure during any part of the load (crash on loading a partial/corrupted checkpoint) and execute logic to find and try the preceeding one

I initially planned to fix it that way (try catch around the load and change dir), but unfortunately, the resume_from_checkpoint dir, once selected, is used in too many places to easily catch and then replace everything prior as well. Another problem is that the user could pass that variable as well (get_last_checkpoint is only called in case of auto-resume, in other cases the variable is set to what the user passes), in which case we don't want to change the directory and should probably raise an error (maybe a better error message than FileNotFound though).

improve checkpoint writing: write to temporary files (or folders in this case) and swap to correct, final name(s) once all writes have succesfully finished. This is a common practice when writing checkpoints / restore states

This is a good suggestion and could also solve this problem. At a quick glance, all the saving happens in _save_checkpoint, so writing everything to a tmp dir and finally moving everything should also work (with the small possibility of crash in the middle of the move).

@SunMarc
Copy link
Member

SunMarc commented Jan 21, 2025

improve checkpoint writing: write to temporary files (or folders in this case) and swap to correct, final name(s) once all writes have succesfully finished. This is a common practice when writing checkpoints / restore states

This solution from @rwightman seems to be the most robust one. Would you like to try to implement this @SilverSoldier ?

@SilverSoldier
Copy link
Contributor Author

Sure, let me change the PR to implement the save to tmp dir approach

@rwightman
Copy link
Contributor

rwightman commented Jan 22, 2025

@SilverSoldier in these cases, the usual situation is to do the writes to a distinct temp name in the same location as the final destination. So if you normally save checkpoints to /mycheckpoints/checkpoint_0 etc... then while writing the temp you'd use /mycheckpoints/_temp_checkpoint ... when that is successful you can then do a os.rename _temp_checkpoint -> checkpoint_0, etc.

For most filesystem renaming a file/folder on the same drive is an atomic operation and very low risk of failure.

I looked at the numpy save and it's a chunked write, pretty sure you can cause partial write failures in many situations.

@SilverSoldier
Copy link
Contributor Author

Have changed to the save to tmp approach using @rwightman's idea (thanks!).

  1. Using TemporaryDirectory in the same checkpoint directory with name tmp-checkpoint-. Because of the execution context, if this fails somewhere in the middle, the tmp directory is cleaned up. Finally, renaming to the correct name.
  2. There is one corner case, if the last epoch is already saved and save_checkpoint is called again at the end of training. In that case, we get the File or Directory already exists error for the checkpoint-50 or whatever dir (I got diff errors when I tried with different setups). In that case, we need to remove the old one and rename the new one. I think both should be exactly the same (checked for one run). If they are 100% the same, we can catch it before even checkpointing (which should save some time as well).

@SunMarc could you please review the PR now?

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating ! Left a few comments. Can you double check @muellerzr ?

src/transformers/trainer.py Outdated Show resolved Hide resolved
src/transformers/trainer.py Outdated Show resolved Hide resolved
src/transformers/trainer.py Outdated Show resolved Hide resolved
@SunMarc SunMarc requested a review from muellerzr January 22, 2025 13:04
@SilverSoldier SilverSoldier changed the title Get latest + complete checkpoint directory when auto-resume from checkpoint Save checkpoint to temporary directory to handle partial saves during failures Jan 23, 2025
@SunMarc
Copy link
Member

SunMarc commented Jan 23, 2025

Make sure to fix the CI with make style

Since partial/missing files due to failures throw error during load
@SilverSoldier
Copy link
Contributor Author

Fixed the code style issues.
I believe the failing tests after merging main (qwen2_5_vl.test_modeling_qwen2_5_vl.Qwen2_5_VLModelTest) are unrelated.

@SilverSoldier SilverSoldier requested a review from SunMarc January 24, 2025 09:42
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Auto-resume from checkpoint throws error if last checkpoint is incomplete
4 participants