Save checkpoint to temporary directory to handle partial saves during failures #35580

SilverSoldier · 2025-01-09T10:41:18Z

What does this PR do?

When auto-resuming from checkpoint, currently the "max" folder is picked as the checkpoint folder to resume from. If there are some missing files (like model, config or trainer_state), this results in FileNotFoundError. This PR picks the latest checkpoint folder (should ideally be last or second last) with all required files allowing training to resume instead of just throwing an error requiring manual removing of files.

Fixes #35782

Currently, I test for atleast one of the model weights along with definitely requiring config and trainer_state. From what I understood the others (like optimizer) were optional, but please let me know if I missed something.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Trainer: @muellerzr and @SunMarc

SunMarc

Thanks for the PR ! Left a comment. Can you double check @muellerzr ?

SunMarc · 2025-01-20T14:22:54Z

src/transformers/trainer_utils.py

+    for checkpoint in sorted(checkpoints, key=lambda x: int(_re_checkpoint.search(x).groups()[0]), reverse=True):
+      if is_valid_checkpoint_dir(checkpoint):
+        break
+    return os.path.join(folder, checkpoint)


I think it can be good to display a warning to the user, specifying that we decided to skip a particular checkpoint because it is missing files.

rwightman · 2025-01-20T16:02:40Z

Checking for existence of files is only a partial solution to this problem, you can have partially written files in a crash during save.

The needed fixes include

error handling on load: detect a failure during any part of the load (crash on loading a partial/corrupted checkpoint) and execute logic to find and try the preceeding one
improve checkpoint writing: write to temporary files (or folders in this case) and swap to correct, final name(s) once all writes have succesfully finished. This is a common practice when writing checkpoints / restore states

SilverSoldier · 2025-01-21T05:10:27Z

@rwightman thanks for your comments.

Checking for existence of files is only a partial solution to this problem, you can have partially written files in a crash during save.

You're right, but, I believe the underlying write to file calls usually handle this issue of partial file when crashing. For instance, safetensors.save uses numpy.save which this SO issue mentions doesn't create file until success. Even in my own runs, I hadn't faced that issue. It could happen due to edge cases, but I believe much less likely than this case.

error handling on load: detect a failure during any part of the load (crash on loading a partial/corrupted checkpoint) and execute logic to find and try the preceeding one

I initially planned to fix it that way (try catch around the load and change dir), but unfortunately, the resume_from_checkpoint dir, once selected, is used in too many places to easily catch and then replace everything prior as well. Another problem is that the user could pass that variable as well (get_last_checkpoint is only called in case of auto-resume, in other cases the variable is set to what the user passes), in which case we don't want to change the directory and should probably raise an error (maybe a better error message than FileNotFound though).

improve checkpoint writing: write to temporary files (or folders in this case) and swap to correct, final name(s) once all writes have succesfully finished. This is a common practice when writing checkpoints / restore states

This is a good suggestion and could also solve this problem. At a quick glance, all the saving happens in _save_checkpoint, so writing everything to a tmp dir and finally moving everything should also work (with the small possibility of crash in the middle of the move).

SunMarc · 2025-01-21T10:30:20Z

improve checkpoint writing: write to temporary files (or folders in this case) and swap to correct, final name(s) once all writes have succesfully finished. This is a common practice when writing checkpoints / restore states

This solution from @rwightman seems to be the most robust one. Would you like to try to implement this @SilverSoldier ?

SilverSoldier · 2025-01-21T11:32:40Z

Sure, let me change the PR to implement the save to tmp dir approach

rwightman · 2025-01-22T00:09:13Z

@SilverSoldier in these cases, the usual situation is to do the writes to a distinct temp name in the same location as the final destination. So if you normally save checkpoints to /mycheckpoints/checkpoint_0 etc... then while writing the temp you'd use /mycheckpoints/_temp_checkpoint ... when that is successful you can then do a os.rename _temp_checkpoint -> checkpoint_0, etc.

For most filesystem renaming a file/folder on the same drive is an atomic operation and very low risk of failure.

I looked at the numpy save and it's a chunked write, pretty sure you can cause partial write failures in many situations.

SilverSoldier · 2025-01-22T12:20:44Z

Have changed to the save to tmp approach using @rwightman's idea (thanks!).

Using TemporaryDirectory in the same checkpoint directory with name tmp-checkpoint-. Because of the execution context, if this fails somewhere in the middle, the tmp directory is cleaned up. Finally, renaming to the correct name.
There is one corner case, if the last epoch is already saved and save_checkpoint is called again at the end of training. In that case, we get the File or Directory already exists error for the checkpoint-50 or whatever dir (I got diff errors when I tried with different setups). In that case, we need to remove the old one and rename the new one. I think both should be exactly the same (checked for one run). If they are 100% the same, we can catch it before even checkpointing (which should save some time as well).

@SunMarc could you please review the PR now?

SunMarc

Thanks for iterating ! Left a few comments. Can you double check @muellerzr ?

src/transformers/trainer.py

SunMarc · 2025-01-23T14:21:53Z

Make sure to fix the CI with make style

Since partial/missing files due to failures throw error during load

SilverSoldier · 2025-01-24T09:42:11Z

Fixed the code style issues.
I believe the failing tests after merging main (qwen2_5_vl.test_modeling_qwen2_5_vl.Qwen2_5_VLModelTest) are unrelated.

SunMarc

LGTM !

HuggingFaceDocBuilderDev · 2025-01-24T21:26:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SilverSoldier requested review from Rocketknight1 and ArthurZucker as code owners January 9, 2025 10:41

SilverSoldier mentioned this pull request Jan 20, 2025

Auto-resume from checkpoint throws error if last checkpoint is incomplete #35782

Open

4 tasks

SunMarc reviewed Jan 20, 2025

View reviewed changes

SilverSoldier force-pushed the checkpoint branch from 9d0410d to 8f8ead2 Compare January 22, 2025 12:08

SunMarc reviewed Jan 22, 2025

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

src/transformers/trainer.py Outdated Show resolved Hide resolved

src/transformers/trainer.py Outdated Show resolved Hide resolved

SunMarc requested a review from muellerzr January 22, 2025 13:04

SilverSoldier force-pushed the checkpoint branch from 8f8ead2 to 81175b6 Compare January 23, 2025 05:12

SilverSoldier changed the title ~~Get latest + complete checkpoint directory when auto-resume from checkpoint~~ Save checkpoint to temporary directory to handle partial saves during failures Jan 23, 2025

SilverSoldier force-pushed the checkpoint branch from 81175b6 to 3c9677f Compare January 23, 2025 10:34

Save checkpoint to temporary folder first

f7a972a

Since partial/missing files due to failures throw error during load

SilverSoldier force-pushed the checkpoint branch from 3c9677f to f7a972a Compare January 24, 2025 04:05

Merge branch 'main' into checkpoint

aed2c9d

SilverSoldier requested a review from SunMarc January 24, 2025 09:42

SunMarc reviewed Jan 24, 2025

View reviewed changes

SunMarc approved these changes Jan 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save checkpoint to temporary directory to handle partial saves during failures #35580

Save checkpoint to temporary directory to handle partial saves during failures #35580

SilverSoldier commented Jan 9, 2025 •

edited

Loading

SunMarc left a comment

SunMarc Jan 20, 2025

rwightman commented Jan 20, 2025

SilverSoldier commented Jan 21, 2025

SunMarc commented Jan 21, 2025

SilverSoldier commented Jan 21, 2025

rwightman commented Jan 22, 2025 •

edited

Loading

SilverSoldier commented Jan 22, 2025

SunMarc left a comment

SunMarc commented Jan 23, 2025

SilverSoldier commented Jan 24, 2025

SunMarc left a comment

HuggingFaceDocBuilderDev commented Jan 24, 2025

Save checkpoint to temporary directory to handle partial saves during failures #35580

Are you sure you want to change the base?

Save checkpoint to temporary directory to handle partial saves during failures #35580

Conversation

SilverSoldier commented Jan 9, 2025 • edited Loading

What does this PR do?

Before submitting

Who can review?

SunMarc left a comment

Choose a reason for hiding this comment

SunMarc Jan 20, 2025

Choose a reason for hiding this comment

rwightman commented Jan 20, 2025

SilverSoldier commented Jan 21, 2025

SunMarc commented Jan 21, 2025

SilverSoldier commented Jan 21, 2025

rwightman commented Jan 22, 2025 • edited Loading

SilverSoldier commented Jan 22, 2025

SunMarc left a comment

Choose a reason for hiding this comment

SunMarc commented Jan 23, 2025

SilverSoldier commented Jan 24, 2025

SunMarc left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jan 24, 2025

SilverSoldier commented Jan 9, 2025 •

edited

Loading

rwightman commented Jan 22, 2025 •

edited

Loading