Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ilab model train --pipeline simple (v0.21.0) #2666

Open
apcameron opened this issue Nov 16, 2024 · 9 comments
Open

ilab model train --pipeline simple (v0.21.0) #2666

apcameron opened this issue Nov 16, 2024 · 9 comments
Labels
bug Something isn't working tech-debt Issue or PR pertaining to technical debt

Comments

@apcameron
Copy link

ilab model train --pipeline simple is looking for the legacy datafiles and not the latest.

ilab model train --pipeline simple --device cuda --data-path ~/.local/share/instructlab/datasets/ --optimize-memory  --model-path ~/.cache/instructlab/models/instructlab/granite-7b-lab --4-bit-quant
INFO 2024-11-16 22:15:29,760 numexpr.utils:161: NumExpr defaulting to 4 threads.
INFO 2024-11-16 22:16:06,758 datasets:59: PyTorch version 2.4.0+cu121 available.
LINUX_TRAIN.PY: NUM EPOCHS IS:  10
LINUX_TRAIN.PY: TRAIN FILE IS:  /home/andrew/.local/share/instructlab/datasets/train_gen.jsonl
LINUX_TRAIN.PY: TEST FILE IS:  /home/andrew/.local/share/instructlab/datasets/test_gen.jsonl
LINUX_TRAIN.PY: Using device 'cuda:0'
  NVidia CUDA version: 12.1
  AMD ROCm HIP version: n/a
  cuda:0 is 'Tesla P4' (7.8 GiB of 7.9 GiB free, capability: 6.1)
  WARNING: You have less than 11811160064 GiB of free GPU memory on '{index}'. Training may fail, use slow shared host memory, or move some layers to CPU.
  Training does not use the local InstructLab serve. Consider stopping the server to free up about 5 GiB of GPU memory.
LINUX_TRAIN.PY: LOADING DATASETS
Unable to find '/home/andrew/.local/share/instructlab/datasets/train_gen.jsonl'
@apcameron apcameron added the bug Something isn't working label Nov 16, 2024
@cdoern
Copy link
Contributor

cdoern commented Nov 17, 2024

Hey @apcameron the simple pipeline should be considered legacy functionality as well, which is why it only works with the legacy dataset. If you want to use cuda, please try the accelerated pipeline or feel free to open an issue to track adding cuda support to the full pipeline!

@apcameron
Copy link
Author

@cdoern My GPU only has 8GB of VRAM and as far as I know the simple pipeline is the only one with the --4-bit-quant option.
If I simple rename the test_mistral-7b-instruct-v0.2.Q4_K_M_2024-11-17T21_10_22.jsonl and the train_mistral-7b-instruct-v0.2.Q4_K_M_2024-11-17T21_10_22.jsonl to test_gen.jsonl and train_gen.jsonl the it works.

The accelerated pipeline gives me an out of memory as it does not support small GPU's or Unified Memory for CUDA or offloading the extra layers to cpu as far as I know.

@cdoern
Copy link
Contributor

cdoern commented Nov 17, 2024

@apcameron the accelerated pipeline supports LoRA and quantization i believe. If you check out the training section of your config you can set lora_rank to anything but 0 (usually 2), the lora_quanrize_dtype, and enable deepspeed cpu offloading.

@apcameron
Copy link
Author

@cdoern When I try this
ilab model train --pipeline accelerated --device cuda --data-path ~/.local/share/instructlab/datasets/train_mistral-7b-instruct-v0.2.Q4_K_M_2024-11-17T21_10_22.jsonl --optimize-memory --model-path ~/.cache/instructlab/models/instructlab/granite-7b-lab --4-bit-quant --lora-alpha 32 --lora-dropout 0.1 --lora-rank 2 --lora-quantize-dtype nf4 --deepspeed-cpu-offload-optimizer true

I get the following error

[2024-11-17 23:03:45,906] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 2024-11-17 23:03:52,432 numexpr.utils:161: NumExpr defaulting to 4 threads.
INFO 2024-11-17 23:03:52,738 datasets:59: PyTorch version 2.4.0+cu121 available.
--- Logging error ---
Traceback (most recent call last):
  File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/model/accelerated_train.py", line 198, in accelerated_train
    run_training(train_args=train_args, torch_args=torch_args)
  File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/training/__init__.py", line 36, in run_training
    return run_training(torch_args=torch_args, train_args=train_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/training/main_ds.py", line 663, in run_training
    check_valid_train_args(train_args)
  File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/training/utils.py", line 88, in check_valid_train_args
    raise ValueError(
ValueError: `accelerate_full_state_at_epoch` is not currently supported when training LoRA models.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.11/logging/__init__.py", line 1110, in emit
    msg = self.format(record)
          ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/logging/__init__.py", line 953, in format
    return fmt.format(record)
           ^^^^^^^^^^^^^^^^^^
  File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/log.py", line 19, in format
    return super().format(record)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/logging/__init__.py", line 687, in format
    record.message = record.getMessage()
                     ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/logging/__init__.py", line 377, in getMessage
    msg = msg % self.args
          ~~~~^~~~~~~~~~~
TypeError: not all arguments converted during string formatting
Call stack:
  File "/data/instructlab/env/bin/ilab", line 8, in <module>
    sys.exit(ilab())
  File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/data/instructlab/env/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/clickext.py", line 323, in wrapper
    return f(*args, **kwargs)
  File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/cli/model/train.py", line 448, in train
    accelerated_train.accelerated_train(
  File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/model/accelerated_train.py", line 201, in accelerated_train
    logger.error("Failed during training loop: ", e)
Message: 'Failed during training loop: '
Arguments: (ValueError('`accelerate_full_state_at_epoch` is not currently supported when training LoRA models.'),)
Accelerated Training failed with 1

@cdoern
Copy link
Contributor

cdoern commented Nov 17, 2024

@apcameron so the error here is actually an incompatibility between our checkpoints system and LoRA. (You will also need --distributed-backend deepspeed as a side note

@RobotSail i thought we fixed this 0.21?

@apcameron can you pip show instructlab-training it's possible this fix was in 0.6.1 not 0.6.0 of that library

@apcameron
Copy link
Author

@cdoern

pip show instructlab-training
Name: instructlab-training
Version: 0.6.1
Summary: Training Library
Home-page: 
Author: 
Author-email: InstructLab <[email protected]>
License: Apache-2.0
Location: /data/instructlab/env/lib/python3.11/site-packages
Requires: aiofiles, datasets, instructlab-dolomite, numba, numpy, packaging, peft, py-cpuinfo, pydantic, pyyaml, rich, torch, transformers, trl, wheel
Required-by: instructlab

@cdoern
Copy link
Contributor

cdoern commented Nov 18, 2024

@apcameron

can you use --fsdp-cpu-offload-optimizer and --distributed-backend fsdp instead? looking at: instructlab/training#295 looks like this should work

@apcameron
Copy link
Author

apcameron commented Nov 18, 2024

@cdoern I get the same error

ilab model train --pipeline accelerated --device cuda --data-path ~/.local/share/instructlab/datasets/train_mistral-7b-instruct-v0.2.Q4_K_M_2024-11-17T21_10_22.jsonl --optimize-memory  --model-path ~/.cache/instructlab/models/instructlab/granite-7b-lab --4-bit-quant --lora-alpha 32 --lora-dropout 0.1 --lora-rank 2 --lora-quantize-dtype nf4 --fsdp-cpu-offload-optimizer true --gpus 1 --distributed-backend fsdp
[2024-11-18 14:19:38,902] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 2024-11-18 14:20:26,901 numexpr.utils:161: NumExpr defaulting to 4 threads.
INFO 2024-11-18 14:20:29,012 datasets:59: PyTorch version 2.4.0+cu121 available.
--- Logging error ---
Traceback (most recent call last):
  File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/model/accelerated_train.py", line 198, in accelerated_train
    run_training(train_args=train_args, torch_args=torch_args)
  File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/training/__init__.py", line 36, in run_training
    return run_training(torch_args=torch_args, train_args=train_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/training/main_ds.py", line 663, in run_training
    check_valid_train_args(train_args)
  File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/training/utils.py", line 88, in check_valid_train_args
    raise ValueError(
ValueError: `accelerate_full_state_at_epoch` is not currently supported when training LoRA models.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.11/logging/__init__.py", line 1110, in emit
    msg = self.format(record)
          ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/logging/__init__.py", line 953, in format
    return fmt.format(record)
           ^^^^^^^^^^^^^^^^^^
  File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/log.py", line 19, in format
    return super().format(record)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/logging/__init__.py", line 687, in format
    record.message = record.getMessage()
                     ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/logging/__init__.py", line 377, in getMessage
    msg = msg % self.args
          ~~~~^~~~~~~~~~~
TypeError: not all arguments converted during string formatting
Call stack:
  File "/data/instructlab/env/bin/ilab", line 8, in <module>
    sys.exit(ilab())
  File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/data/instructlab/env/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/data/instructlab/env/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/clickext.py", line 323, in wrapper
    return f(*args, **kwargs)
  File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/cli/model/train.py", line 448, in train
    accelerated_train.accelerated_train(
  File "/data/instructlab/env/lib/python3.11/site-packages/instructlab/model/accelerated_train.py", line 201, in accelerated_train
    logger.error("Failed during training loop: ", e)
Message: 'Failed during training loop: '
Arguments: (ValueError('`accelerate_full_state_at_epoch` is not currently supported when training LoRA models.'),)
Accelerated Training failed with 1

@knijesh
Copy link

knijesh commented Dec 8, 2024

I am on 0.21.0 and I get the same Error... Wanted to train a better version of model on full OR accelerated. Full doesnt work on GPU and accelerated on One L40 GPU(48GB VRAM) fails.. with memory Error. Whats the recommendation ? @apcameron @cdoern

Epoch: 0, Step: 1, Rank: 0, loss = 1.78125
/home/ibmuser/pyenv/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Epoch 0:   1%|          | 1/130 [00:11<23:40, 11.01s/it]{
    "epoch": 0,
    "step": 1,
    "rank": 0,
    "overall_throughput": 1.96022160644846,
    "lr": 0.0,
    "cuda_mem_allocated": 26.81872797012329,
    "cuda_malloc_retries": 1,
    "num_loss_counted_tokens": 26639,
    "batch_size": 22,
    "total_loss": 1.7778445136829462,
    "samples_seen": 22,
    "gradnorm": null,
    "total_samples": 2692,
    "timestamp": "2024-12-08T14:10:39.778697"
}
 total length: 28948 num samples 21 - rank: 0 num_loss_counted_tokens: 27394
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ibmuser/pyenv/lib/python3.11/site-packages/instructlab/training/main_ds.py", line 967, in <module>
[rank0]:     main(args)
[rank0]:   File "/home/ibmuser/pyenv/lib/python3.11/site-packages/instructlab/training/main_ds.py", line 642, in main
[rank0]:     train(
[rank0]:   File "/home/ibmuser/pyenv/lib/python3.11/site-packages/instructlab/training/main_ds.py", line 385, in train
[rank0]:     output = model(
[rank0]:              ^^^^^^
[rank0]:   File "/home/ibmuser/pyenv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ibmuser/pyenv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ibmuser/pyenv/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 863, in forward
[rank0]:     output = self._fsdp_wrapped_module(*args, **kwargs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ibmuser/pyenv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ibmuser/pyenv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ibmuser/pyenv/lib/python3.11/site-packages/instructlab/training/utils.py", line 381, in reduce_sum_forward
[rank0]:     output = model.__original_forward__(
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ibmuser/pyenv/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1214, in forward
[rank0]:     loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **loss_kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ibmuser/pyenv/lib/python3.11/site-packages/transformers/loss/loss_utils.py", line 46, in ForCausalLMLoss
[rank0]:     loss = fixed_cross_entropy(shift_logits, shift_labels, num_items_in_batch, ignore_index, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ibmuser/pyenv/lib/python3.11/site-packages/transformers/loss/loss_utils.py", line 26, in fixed_cross_entropy
[rank0]:     loss = nn.functional.cross_entropy(source, target, ignore_index=ignore_index, reduction=reduction)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ibmuser/pyenv/lib/python3.11/site-packages/torch/nn/functional.py", line 3104, in cross_entropy
[rank0]:     return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.43 GiB. GPU 0 has a total capacity of 44.52 GiB of which 1.47 GiB is free. Including non-PyTorch memory, this process has 43.05 GiB memory in use. Of the allocated memory 39.87 GiB is allocated by PyTorch, and 2.35 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Epoch 0:   1%|          | 1/130 [00:13<29:51, 13.89s/it]
E1208 14:10:43.465000 140500345648256 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 2134572) of binary: /home/ibmuser/pyenv/bin/python3.11
Traceback (most recent call last):
  File "/home/ibmuser/pyenv/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/ibmuser/pyenv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/ibmuser/pyenv/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/ibmuser/pyenv/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/ibmuser/pyenv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ibmuser/pyenv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/ibmuser/pyenv/lib/python3.11/site-packages/instructlab/training/main_ds.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-08_14:10:43
  host      : ce-sg-gpu-l40-vpc-24x120-ubuntu22
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2134572)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Training subprocess has not exited yet. Sending SIGTERM.
Waiting for process to exit, 60s...
--- Logging error ---
Traceback (most recent call last):
  File "/home/ibmuser/pyenv/lib/python3.11/site-packages/instructlab/model/accelerated_train.py", line 198, in accelerated_train
    run_training(train_args=train_args, torch_args=torch_args)
  File "/home/ibmuser/pyenv/lib/python3.11/site-packages/instructlab/training/__init__.py", line 36, in run_training
    return run_training(torch_args=torch_args, train_args=train_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ibmuser/pyenv/lib/python3.11/site-packages/instructlab/training/main_ds.py", line 831, in run_training
    raise RuntimeError(
RuntimeError: Suffered a failure during distributed training. Please see the training logs for more context.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.11/logging/__init__.py", line 1110, in emit
    msg = self.format(record)
          ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/logging/__init__.py", line 953, in format
    return fmt.format(record)
           ^^^^^^^^^^^^^^^^^^
  File "/home/ibmuser/pyenv/lib/python3.11/site-packages/instructlab/log.py", line 19, in format
    return super().format(record)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/logging/__init__.py", line 687, in format
    record.message = record.getMessage()
                     ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/logging/__init__.py", line 377, in getMessage
    msg = msg % self.args
          ~~~~^~~~~~~~~~~
TypeError: not all arguments converted during string formatting
Call stack:
  File "/home/ibmuser/pyenv/bin/ilab", line 8, in <module>
    sys.exit(ilab())
  File "/home/ibmuser/pyenv/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/ibmuser/pyenv/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/ibmuser/pyenv/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ibmuser/pyenv/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ibmuser/pyenv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ibmuser/pyenv/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/ibmuser/pyenv/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/ibmuser/pyenv/lib/python3.11/site-packages/instructlab/clickext.py", line 323, in wrapper
    return f(*args, **kwargs)
  File "/home/ibmuser/pyenv/lib/python3.11/site-packages/instructlab/cli/model/train.py", line 448, in train
    accelerated_train.accelerated_train(
  File "/home/ibmuser/pyenv/lib/python3.11/site-packages/instructlab/model/accelerated_train.py", line 201, in accelerated_train
    logger.error("Failed during training loop: ", e)
Message: 'Failed during training loop: '
Arguments: (RuntimeError('Suffered a failure during distributed training. Please see the training logs for more context.'),)
Accelerated Training failed with 1`

```

@ktam3 ktam3 added the tech-debt Issue or PR pertaining to technical debt label Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working tech-debt Issue or PR pertaining to technical debt
Projects
None yet
Development

No branches or pull requests

4 participants