-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save checkpoint to temporary directory to handle partial saves during failures #35580
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR ! Left a comment. Can you double check @muellerzr ?
src/transformers/trainer_utils.py
Outdated
for checkpoint in sorted(checkpoints, key=lambda x: int(_re_checkpoint.search(x).groups()[0]), reverse=True): | ||
if is_valid_checkpoint_dir(checkpoint): | ||
break | ||
return os.path.join(folder, checkpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be good to display a warning to the user, specifying that we decided to skip a particular checkpoint because it is missing files.
Checking for existence of files is only a partial solution to this problem, you can have partially written files in a crash during save. The needed fixes include
|
@rwightman thanks for your comments.
You're right, but, I believe the underlying write to file calls usually handle this issue of partial file when crashing. For instance, safetensors.save uses numpy.save which this SO issue mentions doesn't create file until success. Even in my own runs, I hadn't faced that issue. It could happen due to edge cases, but I believe much less likely than this case.
I initially planned to fix it that way (try catch around the load and change dir), but unfortunately, the resume_from_checkpoint dir, once selected, is used in too many places to easily catch and then replace everything prior as well. Another problem is that the user could pass that variable as well (get_last_checkpoint is only called in case of auto-resume, in other cases the variable is set to what the user passes), in which case we don't want to change the directory and should probably raise an error (maybe a better error message than FileNotFound though).
This is a good suggestion and could also solve this problem. At a quick glance, all the saving happens in |
This solution from @rwightman seems to be the most robust one. Would you like to try to implement this @SilverSoldier ? |
Sure, let me change the PR to implement the save to tmp dir approach |
@SilverSoldier in these cases, the usual situation is to do the writes to a distinct temp name in the same location as the final destination. So if you normally save checkpoints to For most filesystem renaming a file/folder on the same drive is an atomic operation and very low risk of failure. I looked at the numpy save and it's a chunked write, pretty sure you can cause partial write failures in many situations. |
9d0410d
to
8f8ead2
Compare
Have changed to the save to tmp approach using @rwightman's idea (thanks!).
@SunMarc could you please review the PR now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating ! Left a few comments. Can you double check @muellerzr ?
8f8ead2
to
81175b6
Compare
81175b6
to
3c9677f
Compare
Make sure to fix the CI with |
Since partial/missing files due to failures throw error during load
3c9677f
to
f7a972a
Compare
Fixed the code style issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
What does this PR do?
When auto-resuming from checkpoint, currently the "max" folder is picked as the checkpoint folder to resume from. If there are some missing files (like model, config or trainer_state), this results in FileNotFoundError. This PR picks the latest checkpoint folder (should ideally be last or second last) with all required files allowing training to resume instead of just throwing an error requiring manual removing of files.
Fixes #35782
Currently, I test for atleast one of the model weights along with definitely requiring config and trainer_state. From what I understood the others (like optimizer) were optional, but please let me know if I missed something.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Trainer: @muellerzr and @SunMarc