-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ilab model train --pipeline simple (v0.21.0) #2666
Comments
Hey @apcameron the simple pipeline should be considered legacy functionality as well, which is why it only works with the legacy dataset. If you want to use cuda, please try the accelerated pipeline or feel free to open an issue to track adding cuda support to the full pipeline! |
@cdoern My GPU only has 8GB of VRAM and as far as I know the simple pipeline is the only one with the --4-bit-quant option. The accelerated pipeline gives me an out of memory as it does not support small GPU's or Unified Memory for CUDA or offloading the extra layers to cpu as far as I know. |
@apcameron the accelerated pipeline supports LoRA and quantization i believe. If you check out the training section of your config you can set lora_rank to anything but 0 (usually 2), the lora_quanrize_dtype, and enable deepspeed cpu offloading. |
@cdoern When I try this I get the following error [2024-11-17 23:03:45,906] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO 2024-11-17 23:03:52,432 numexpr.utils:161: NumExpr defaulting to 4 threads. INFO 2024-11-17 23:03:52,738 datasets:59: PyTorch version 2.4.0+cu121 available. --- Logging error --- Traceback (most recent call last): File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/model/accelerated_train.py", line 198, in accelerated_train run_training(train_args=train_args, torch_args=torch_args) File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/training/__init__.py", line 36, in run_training return run_training(torch_args=torch_args, train_args=train_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/training/main_ds.py", line 663, in run_training check_valid_train_args(train_args) File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/training/utils.py", line 88, in check_valid_train_args raise ValueError( ValueError: `accelerate_full_state_at_epoch` is not currently supported when training LoRA models. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.11/logging/__init__.py", line 1110, in emit msg = self.format(record) ^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/logging/__init__.py", line 953, in format return fmt.format(record) ^^^^^^^^^^^^^^^^^^ File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/log.py", line 19, in format return super().format(record) ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/logging/__init__.py", line 687, in format record.message = record.getMessage() ^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/logging/__init__.py", line 377, in getMessage msg = msg % self.args ~~~~^~~~~~~~~~~ TypeError: not all arguments converted during string formatting Call stack: File "/data/instructlab/env/bin/ilab", line 8, in <module> sys.exit(ilab()) File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/data/instructlab/env/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func return f(get_current_context(), *args, **kwargs) File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/clickext.py", line 323, in wrapper return f(*args, **kwargs) File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/cli/model/train.py", line 448, in train accelerated_train.accelerated_train( File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/model/accelerated_train.py", line 201, in accelerated_train logger.error("Failed during training loop: ", e) Message: 'Failed during training loop: ' Arguments: (ValueError('`accelerate_full_state_at_epoch` is not currently supported when training LoRA models.'),) Accelerated Training failed with 1 |
@apcameron so the error here is actually an incompatibility between our checkpoints system and LoRA. (You will also need @RobotSail i thought we fixed this 0.21? @apcameron can you |
pip show instructlab-training Name: instructlab-training Version: 0.6.1 Summary: Training Library Home-page: Author: Author-email: InstructLab <[email protected]> License: Apache-2.0 Location: /data/instructlab/env/lib/python3.11/site-packages Requires: aiofiles, datasets, instructlab-dolomite, numba, numpy, packaging, peft, py-cpuinfo, pydantic, pyyaml, rich, torch, transformers, trl, wheel Required-by: instructlab |
can you use |
@cdoern I get the same error ilab model train --pipeline accelerated --device cuda --data-path ~/.local/share/instructlab/datasets/train_mistral-7b-instruct-v0.2.Q4_K_M_2024-11-17T21_10_22.jsonl --optimize-memory --model-path ~/.cache/instructlab/models/instructlab/granite-7b-lab --4-bit-quant --lora-alpha 32 --lora-dropout 0.1 --lora-rank 2 --lora-quantize-dtype nf4 --fsdp-cpu-offload-optimizer true --gpus 1 --distributed-backend fsdp [2024-11-18 14:19:38,902] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO 2024-11-18 14:20:26,901 numexpr.utils:161: NumExpr defaulting to 4 threads. INFO 2024-11-18 14:20:29,012 datasets:59: PyTorch version 2.4.0+cu121 available. --- Logging error --- Traceback (most recent call last): File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/model/accelerated_train.py", line 198, in accelerated_train run_training(train_args=train_args, torch_args=torch_args) File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/training/__init__.py", line 36, in run_training return run_training(torch_args=torch_args, train_args=train_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/training/main_ds.py", line 663, in run_training check_valid_train_args(train_args) File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/training/utils.py", line 88, in check_valid_train_args raise ValueError( ValueError: `accelerate_full_state_at_epoch` is not currently supported when training LoRA models. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.11/logging/__init__.py", line 1110, in emit msg = self.format(record) ^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/logging/__init__.py", line 953, in format return fmt.format(record) ^^^^^^^^^^^^^^^^^^ File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/log.py", line 19, in format return super().format(record) ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/logging/__init__.py", line 687, in format record.message = record.getMessage() ^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/logging/__init__.py", line 377, in getMessage msg = msg % self.args ~~~~^~~~~~~~~~~ TypeError: not all arguments converted during string formatting Call stack: File "/data/instructlab/env/bin/ilab", line 8, in <module> sys.exit(ilab()) File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/data/instructlab/env/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func return f(get_current_context(), *args, **kwargs) File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/clickext.py", line 323, in wrapper return f(*args, **kwargs) File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/cli/model/train.py", line 448, in train accelerated_train.accelerated_train( File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/model/accelerated_train.py", line 201, in accelerated_train logger.error("Failed during training loop: ", e) Message: 'Failed during training loop: ' Arguments: (ValueError('`accelerate_full_state_at_epoch` is not currently supported when training LoRA models.'),) Accelerated Training failed with 1 |
I am on 0.21.0 and I get the same Error... Wanted to train a better version of model on full OR accelerated. Full doesnt work on GPU and accelerated on One L40 GPU(48GB VRAM) fails.. with memory Error. Whats the recommendation ? @apcameron @cdoern
|
ilab model train --pipeline simple is looking for the legacy datafiles and not the latest.
The text was updated successfully, but these errors were encountered: