Merge branch 'instructlab:main' into feature/add-try-catch-import-to-…

…deepspeed
instructlab · Nov 8, 2024 · 37c4fa6 · 37c4fa6
2 parents 9c03b6b + 45162d5
commit 37c4fa6
Show file tree

Hide file tree

Showing 10 changed files with 423 additions and 82 deletions.
diff --git a/.github/workflows/pypi.yaml b/.github/workflows/pypi.yaml
@@ -77,7 +77,7 @@ jobs:
                   path: dist
 
             - name: "Upload to Test PyPI"
-              uses: pypa/gh-action-pypi-publish@61da13deb5f5124fb1536194f82ed3d9bbc7e8f3 # v1.12.0
+              uses: pypa/gh-action-pypi-publish@15c56dba361d8335944d31a2ecd17d700fc7bcbc # v1.12.2
               with:
                   repository-url: https://test.pypi.org/legacy/
 
@@ -129,4 +129,4 @@ jobs:
                   rm ./dist/*.sigstore.json
 
             - name: "Upload to PyPI"
-              uses: pypa/gh-action-pypi-publish@61da13deb5f5124fb1536194f82ed3d9bbc7e8f3 # v1.12.0
+              uses: pypa/gh-action-pypi-publish@15c56dba361d8335944d31a2ecd17d700fc7bcbc # v1.12.2
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,250 @@
+# Changelog
+
+## v0.5.5
+
+### v0.5.5 Features
+
+* e2e: replace old small job with new medium job
+
+### v0.5.5 Fixes
+
+* fix: incorrect label for AWS medium runner
+* chore: add exit code & tox fix
+
+### v0.5.5 Infrastructure
+
+* ci: grant HF_TOKEN access to the medium-size E2E CI job
+
+## v0.5.4
+
+### v0.5.4 Features
+
+* Add rocm extra to pyproject.toml
+
+## v0.5.3
+
+### v0.5.3 Fixes
+
+* fix: Add explicit flash_attn requirement for ROCm
+
+## v0.5.2 - Fix Pretraining Masking
+
+### v0.5.2 Fixes
+
+* fix: improve linting and automation
+* Fix pretrain token list->int for masking
+
+## v0.5.1
+
+### v0.5.1 Fixes
+
+* fix: updates sorting logic to correctly compare numbers
+
+## v0.5.0 - FSDP and Full-State Checkpoint Resuming
+
+### v0.5.0 Features
+
+* feat: add e2e test for instructlab CI
+* feat: add mergify
+* Adding FSDP Support to Training Library by @aldopareja @Maxusmusti @RobotSail
+* adds Accelerate full-state (opt, lr_sched, params)
+* changes StreamablePopen to return a process and implement listening
+
+### v0.5.0 Fixes
+
+* Fix lint error to make CI happy
+* Fix typos
+* Ap/fix multipack for non granite models
+* Fix generic chat template saved to tokenizer for generation
+* Fix linting error and missing quote
+
+### v0.5.0 Infrastructure
+
+* Add license identifiers
+* ci: update runner labels to uniquely identify instance sizes
+* ci: minor cleanup of E2E job
+* Fixing e2e to use relative path for working-directory
+* switch -T to -a
+* github: add stale bot to training repo
+* fix: markdown lint error and mergify bug
+* Bump actions/checkout from 4.1.7 to 4.2.0
+* Bump step-security/harden-runner from 2.8.1 to 2.9.1
+* Bump pypa/gh-action-pypi-publish from 1.9.0 to 1.10.2
+* Bump actions/setup-python from 5.1.0 to 5.2.0
+* Bump rhysd/actionlint from 1.7.1 to 1.7.2
+* Bump hynek/build-and-inspect-python-package from 2.6.0 to 2.9.0
+* Bump DavidAnson/markdownlint-cli2-action from 16.0.0 to 17.0.0
+* ci: fix lint action
+* ci: add AWS tags to show github ref and PR num for all jobs
+
+## v0.5.0 Alpha 0 - The FSDP Release Pre-release
+
+### v0.5.0 Alpha Description
+
+The FSDP Release introduces FSDP support in addition to the existing DeepSpeed support through the accelerate library.
+
+### v0.5.0 Alpha Features
+
+* feat: add e2e test for instructlab CI
+* feat: add mergify
+* Adding FSDP Support to Training Library by @aldopareja @Maxusmusti @RobotSail
+
+### v0.5.0 Alpha Fixes
+
+* Fix lint error to make CI happy
+* Fix typos
+* Ap/fix multipack for non granite models
+* Fix linting error and missing quote
+
+### v0.5.0 Alpha Infrastructure
+
+* Add license identifiers
+* ci: update runner labels to uniquely identify instance sizes
+* ci: minor cleanup of E2E job
+* Fixing e2e to use relative path for working-directory
+* Bump step-security/harden-runner from 2.8.1 to 2.9.1
+* Bump pypa/gh-action-pypi-publish from 1.9.0 to 1.10.2
+* Bump actions/setup-python from 5.1.0 to 5.2.0
+* Bump rhysd/actionlint from 1.7.1 to 1.7.2
+* Bump hynek/build-and-inspect-python-package from 2.6.0 to 2.9.0
+* Bump DavidAnson/markdownlint-cli2-action from 16.0.0 to 17.0.0
+* ci: fix lint action
+* ci: add AWS tags to show github ref and PR num for all jobs
+
+## v0.4.2
+
+### v0.4.2 Features
+
+* Provide safeguards during training
+
+## v0.4.1
+
+### v0.4.1 Changes
+
+* makes saving every save_samples an optional feature
+
+## v0.4.0
+
+### v0.4.0 Features
+
+* Adds a flag to save checkpoints at the end of an epoch
+
+### v0.4.0 Changes
+
+* Change success message at end of training
+
+## v0.3.2
+
+### v0.3.2 Features
+
+* Accept tuples for lora.target_modules
+
+### v0.3.2 Documentation
+
+* patch some hyper parameter arg descriptions in README
+
+## v0.3.1
+
+### v0.3.1 Dependencies
+
+* Update requirements to have bitsandbytes min and dolomite min
+
+## v0.3.0
+
+### v0.3.0 Features
+
+* Updating token masking to support pretraining w/ masked special tokens
+* Adding weight merging for LoRA/QLoRA ckpts
+
+### v0.3.0 Fixes
+
+* remove dead code
+* fix: changes the check to check against both the enum option and enum value
+
+## v0.2.0
+
+### v0.2.0 Features
+
+* Fix ckpt save to include architecture for inference runtime consumption
+* Logging updates
+
+### v0.2.0 Performance
+
+* Reducing deepspeed timeout to 10mins
+
+## v0.1.0
+
+### v0.1.0 Features
+
+* Flash Attention Disable Toggle (Take 2)
+
+### v0.1.0 Performance
+
+* Reduce Unnecessary Multiprocessing
+
+### v0.1.0 Fixes
+
+* 🐛: fix optimizer selection logic so that FusedAdam is never loaded when CPU offloading is enabled
+* Add wheel to requirements
+
+## v0.0.5.1
+
+### v0.0.5.1 Fixes
+
+This release includes PR [#121](https://github.com/instructlab/training/pull/121) to overcome an issue where our way of lazily importing the run_training function is being picked up as an error by pylint.
+
+## v0.0.5
+
+Minor bugfixes and updates.
+
+## v0.0.4
+
+Minor bugfixes and updates.
+
+## v0.0.3
+
+Minor bugfixes and updates.
+
+## v0.0.2
+
+### Features
+
+This introduces the instructlab library as a package in the instructlab package namespace.
+
+To install it:
+
+```bash
+pip install instructlab-training
+```
+
+And to install it with flash-attn and other CUDA-dependent packages, you can use
+
+```bash
+pip install instructlab-training[cuda]
+```
+
+Here's how to use it:
+
+```python
+from instructlab.training.config import TorchrunArgs, TrainingArgs, run_training
+
+torchrun_args = TorchrunArgs(
+    nproc_per_node = 1,  # 1 GPU
+    nnodes = 1,  # only 1 overall machine in the system
+    node_rank = 0,  # rank of the current machine
+    rdzv_id = 123,  # what ID other nodes will join on
+    rdzv_endpoint = '0.0.0.0:12345'  # address where other nodes will join
+)
+
+training_args = TrainingArgs(
+    # specify training args here
+)
+
+run_training(torch_args = torchrun_args, train_args = training_args)
+```
+
+## v0.0.1
+
+### v0.0.1 Features
+
+Initial release with same features as v0.0.2.
diff --git a/src/instructlab/training/chat_templates/ibm_generic_tmpl.py b/src/instructlab/training/chat_templates/ibm_generic_tmpl.py
@@ -1,30 +1,44 @@
 # SPDX-License-Identifier: Apache-2.0
 
 # First Party
-from instructlab.training.tokenizer_utils import SpecialTokens, TokenInfo
+from instructlab.training.chat_templates.utils import SpecialTokens, TokenInfo
 
 SPECIAL_TOKENS = SpecialTokens(
-    system=TokenInfo("<|system|>", add_to_tokenizer=True),
-    user=TokenInfo("<|user|>", add_to_tokenizer=True),
-    assistant=TokenInfo("<|assistant|>", add_to_tokenizer=True),
-    eos=TokenInfo("<|endoftext|>", add_to_tokenizer=True),
-    pad=TokenInfo("<|pad|>", add_to_tokenizer=True),
-    bos=TokenInfo("<|begginingoftext|>", add_to_tokenizer=True),
+    start_role=TokenInfo("<|start_of_role|>", add_to_tokenizer=True),
+    end_role=TokenInfo("<|end_of_role|>", add_to_tokenizer=True),
+    tool=TokenInfo("<|tool_call|>", add_to_tokenizer=True),
+    eos=TokenInfo("<|end_of_text|>", add_to_tokenizer=True),
+    bos=TokenInfo("<|end_of_text|>", add_to_tokenizer=True),
+    pad=TokenInfo("<|end_of_text|>", add_to_tokenizer=True),
 )
 
 CHAT_TEMPLATE = (
+    "{%- if tools %}"
+    "{{ '<|start_of_role|>available_tools<|end_of_role|>\n' }}"
+    "{% for tool in tools %}"
+    "{{ tool | tojson(indent=4) }}"
+    "{% if not loop.last %}"
+    "{{- '\n\n' }}"
+    "{% endif %}"
+    "{% endfor %}"
+    "{{ '<|end_of_text|>\n' }}"
+    "{% endif %}"
     "{% for message in messages %}"
-    "{% if message['role'] == 'pretraining' %}"
-    "{{'<|pretrain|>' + message['content'] + '<|endoftext|>' + '<|/pretrain|>' }}"
-    "{% elif message['role'] == 'system' %}"
-    "{{'<|system|>'+ '\n' + message['content'] + '\n'}}"
+    "{% if message['role'] == 'system' %}"
+    "{{ '<|start_of_role|>system<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}"
+    "{% elif message['role'] == 'pretraining' %}"
+    "{{ '<|pretrain|>' + message['content'] + '<|end_of_text|>' + '<|/pretrain|>'}}"
     "{% elif message['role'] == 'user' %}"
-    "{{'<|user|>' + '\n' + message['content'] + '\n'}}"
+    "{{ '<|start_of_role|>user<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}"
     "{% elif message['role'] == 'assistant' %}"
-    "{{'<|assistant|>' + '\n' + message['content'] + '<|endoftext|>' + ('' if loop.last else '\n')}}"
+    "{{ '<|start_of_role|>assistant<|end_of_role|>'  + message['content'] + '<|end_of_text|>\n' }}"
+    "{% elif message['role'] == 'assistant_tool_call' %}"
+    "{{ '<|start_of_role|>assistant<|end_of_role|><|tool_call|>' + message['content'] + '<|end_of_text|>\n' }}"
+    "{% elif message['role'] == 'tool_response' %}"
+    "{{ '<|start_of_role|>tool_response<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}"
     "{% endif %}"
     "{% if loop.last and add_generation_prompt %}"
-    "{{ '<|assistant|>' + '\n' }}"
+    "{{ '<|start_of_role|>assistant<|end_of_role|>' }}"
     "{% endif %}"
     "{% endfor %}"
 )
diff --git a/src/instructlab/training/chat_templates/ibm_legacy_tmpl.py b/src/instructlab/training/chat_templates/ibm_legacy_tmpl.py
@@ -0,0 +1,30 @@
+# SPDX-License-Identifier: Apache-2.0
+
+# First Party
+from instructlab.training.chat_templates.utils import SpecialTokens, TokenInfo
+
+SPECIAL_TOKENS = SpecialTokens(
+    system=TokenInfo("<|system|>", add_to_tokenizer=True),
+    user=TokenInfo("<|user|>", add_to_tokenizer=True),
+    assistant=TokenInfo("<|assistant|>", add_to_tokenizer=True),
+    eos=TokenInfo("<|endoftext|>", add_to_tokenizer=True),
+    pad=TokenInfo("<|pad|>", add_to_tokenizer=True),
+    bos=TokenInfo("<|begginingoftext|>", add_to_tokenizer=True),
+)
+
+CHAT_TEMPLATE = (
+    "{% for message in messages %}"
+    "{% if message['role'] == 'pretraining' %}"
+    "{{'<|pretrain|>' + message['content'] + '<|endoftext|>' + '<|/pretrain|>' }}"
+    "{% elif message['role'] == 'system' %}"
+    "{{'<|system|>'+ '\n' + message['content'] + '\n'}}"
+    "{% elif message['role'] == 'user' %}"
+    "{{'<|user|>' + '\n' + message['content'] + '\n'}}"
+    "{% elif message['role'] == 'assistant' %}"
+    "{{'<|assistant|>' + '\n' + message['content'] + '<|endoftext|>' + ('' if loop.last else '\n')}}"
+    "{% endif %}"
+    "{% if loop.last and add_generation_prompt %}"
+    "{{ '<|assistant|>' + '\n' }}"
+    "{% endif %}"
+    "{% endfor %}"
+)
diff --git a/src/instructlab/training/chat_templates/mistral_tmpl.py b/src/instructlab/training/chat_templates/mistral_tmpl.py
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: Apache-2.0
 
 # First Party
-from instructlab.training.tokenizer_utils import SpecialTokens, TokenInfo
+from instructlab.training.chat_templates.utils import SpecialTokens, TokenInfo
 
 SPECIAL_TOKENS = SpecialTokens(
     bos=TokenInfo("<s>", add_to_tokenizer=True),

diff --git a/src/instructlab/training/chat_templates/utils.py b/src/instructlab/training/chat_templates/utils.py
@@ -0,0 +1,29 @@
+# Standard
+from dataclasses import dataclass, field
+from typing import List
+
+
+@dataclass
+class TokenInfo:
+    token: str
+    add_to_tokenizer: bool = False
+
+
+@dataclass
+class SpecialTokens:
+    system: TokenInfo = field(default_factory=lambda: TokenInfo(""))
+    user: TokenInfo = field(default_factory=lambda: TokenInfo(""))
+    assistant: TokenInfo = field(default_factory=lambda: TokenInfo(""))
+    eos: TokenInfo = field(default_factory=lambda: TokenInfo(""))
+    pad: TokenInfo = field(default_factory=lambda: TokenInfo(""))
+    bos: TokenInfo = field(default_factory=lambda: TokenInfo(""))
+    start_role: TokenInfo = field(default_factory=lambda: TokenInfo(""))
+    end_role: TokenInfo = field(default_factory=lambda: TokenInfo(""))
+    tool: TokenInfo = field(default_factory=lambda: TokenInfo(""))
+
+    def get_tokens_to_add(self) -> List[str]:
+        return [
+            token_info.token
+            for token_info in self.__dict__.values()
+            if token_info.add_to_tokenizer and token_info.token
+        ]