You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Firstly, make sure that everything works well in https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/finetune_hf_llama. This make sure that you have solved all environment issue and you can start to convert the huggingface checkpoint into a zero enabled ckpt.
Checkpoint Conversion
The simplest idea is using the script hf2megads_weight_converter.py and disable pipeline parallel to get a Deepspeed ZeRO Checkpoint.
Ah! But it can not be done when you are using this script of https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/finetune_hf_llama.
When you are trying to do such a thing, you will get error.
Then you may think universal_checkpointing technique may help you to achieve such a conversion.
Ah! You wish
universal_checkpointing can help you to achive conversion between ZeRO1/2/3 checkpoints with different world size and TP/PP/ZeRO1 checkpoints with different parallel size. But it can not achieve conversion between TP/PP/ZeRO1 and ZeRO2/3.
So there is only one way left, to figure out how to achive a ZeRO2/3 checkpoint conversion method based on this script hf2megads_weight_converter.py.
Finetune script
After getting a ZeRO checkpoint, everything else is quite easy.
But since this tutorial https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/finetune_hf_llama do not expect you will finetune llama using ZeRO and without pipeline-parallel, there is still a little effort to get there.
Detail modification , please refer to this fix-zero-load.
and it should work well.
The text was updated successfully, but these errors were encountered:
In my case, another problem arises when I specify --untie-embeddings-and-output-weights in the script. The whole program gets stuck in an NCCL all-gather operation. Surprisingly, it gets stuck at a random iteration, making reproduction quite difficult. If you encounter the same situation, try modifying the code in language_model.py to forcefully disable tensor parallel (TP) linear.
Setup Environment
Firstly, make sure that everything works well in
https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/finetune_hf_llama
. This make sure that you have solved all environment issue and you can start to convert the huggingface checkpoint into a zero enabled ckpt.Checkpoint Conversion
The simplest idea is using the script hf2megads_weight_converter.py and disable pipeline parallel to get a Deepspeed ZeRO Checkpoint.
Ah! But it can not be done when you are using this script of
https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/finetune_hf_llama
.When you are trying to do such a thing, you will get error.
Megatron-DeepSpeed/tools/hf2megads_weight_converter.py
Lines 288 to 291 in 3afd267
Then you may think universal_checkpointing technique may help you to achieve such a conversion.
Ah! You wish
universal_checkpointing can help you to achive conversion between ZeRO1/2/3 checkpoints with different world size and TP/PP/ZeRO1 checkpoints with different parallel size. But it can not achieve conversion between TP/PP/ZeRO1 and ZeRO2/3.
So there is only one way left, to figure out how to achive a ZeRO2/3 checkpoint conversion method based on this script hf2megads_weight_converter.py.
Finetune script
After getting a ZeRO checkpoint, everything else is quite easy.
But since this tutorial
https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/finetune_hf_llama
do not expect you will finetune llama using ZeRO and without pipeline-parallel, there is still a little effort to get there.Detail modification , please refer to this fix-zero-load.
and it should work well.
The text was updated successfully, but these errors were encountered: