Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_tokenizer fails sometimes with multiprocessing #105

Open
gabrielhuang opened this issue Nov 19, 2024 · 3 comments
Open

load_tokenizer fails sometimes with multiprocessing #105

gabrielhuang opened this issue Nov 19, 2024 · 3 comments

Comments

@gabrielhuang
Copy link
Collaborator

    def load_tokenizer(self):
        if self.tokenizer is None:
            import transformers

            name = _MOCK_TOKENIZER if _MOCK_TOKENIZER else (self.tokenizer_name or self.model_name)
            self.tokenizer = transformers.AutoTokenizer.from_pretrained(name)

Fails with

  File "/home/toolkit/.local/lib/python3.12/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/toolkit/.local/lib/python3.12/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/toolkit/code/cat-mono-repo/llmd2-core/src/llmd2/tapeagents_tmp/ghreat/dev/run_user_simulator.py", line 91, in main
    raise exception
  File "/home/toolkit/code/cat-mono-repo/llmd2-core/src/llmd2/tapeagents_tmp/ghreat/dev/run_user_simulator.py", line 41, in run_user_simulator_agent
    user_simulator_agent_tape = user_simulator_agent.run(user_simulator_agent_tape).get_final_tape()
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/toolkit/code/TapeAgents/tapeagents/agent.py", line 60, in get_final_tape
    for event in self:
                 ^^^^
  File "/home/toolkit/code/TapeAgents/tapeagents/agent.py", line 364, in _run_implementation
    for step in current_subagent.run_iteration(tape):
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/toolkit/code/TapeAgents/tapeagents/agent.py", line 344, in run_iteration
    for step in self.generate_steps(tape, llm_stream):
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/toolkit/code/TapeAgents/tapeagents/agent.py", line 296, in generate_steps
    for step in node.generate_steps(self, tape, llm_stream):
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/toolkit/code/cat-mono-repo/llmd2-core/src/llmd2/tapeagents_tmp/ghreat/user_simulator_agent.py", line 193, in generate_steps
    user_utterance = llm_stream.get_text()
                     ^^^^^^^^^^^^^^^^^^^^^
  File "/home/toolkit/code/TapeAgents/tapeagents/llms.py", line 65, in get_text
    o = self.get_output()
        ^^^^^^^^^^^^^^^^^
  File "/home/toolkit/code/TapeAgents/tapeagents/llms.py", line 59, in get_output
    for event in self:
                 ^^^^
  File "/home/toolkit/code/TapeAgents/tapeagents/llms.py", line 56, in __next__
    return next(self.generator)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/toolkit/code/TapeAgents/tapeagents/llms.py", line 189, in _implementation
    toks = self.count_tokens(prompt.messages)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/toolkit/code/TapeAgents/tapeagents/llms.py", line 447, in count_tokens
    self.load_tokenizer()
  File "/home/toolkit/code/TapeAgents/tapeagents/llms.py", line 323, in load_tokenizer
    self.tokenizer = transformers.AutoTokenizer.from_pretrained(name)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'transformers' has no attribute 'AutoTokenizer'

This is because dynamic imports and multiprocessing don't play together well due to pickling

@gabrielhuang
Copy link
Collaborator Author

gabrielhuang commented Nov 19, 2024

If we want to make transformers optional, here is my workaround to pre-import transformers (from my main script)

import tapeagents.llms
import transformers
tapeagents.llms.transformers = transformers

@gabrielhuang
Copy link
Collaborator Author

Note: @ehsk and @NicolasAG have encountered similar problem, so i think this should be fixed fairly quickly. We could move TrainableLLM to a separate file and change to static import if not using transformers as a dependency to the whole project is important

@ollmer
Copy link
Collaborator

ollmer commented Nov 25, 2024

Thanks for the report, Gabriel! Transformers are already in the main requirements.txt, so it's not really optional. It was put under conditional import because import transformers takes a crazy amount of time, 3-5 seconds, which considerably increased the startup time of almost all of our scripts, as we're using import tapeagents.llms almost everywhere.
We can try https://github.com/huggingface/tokenizers instead of the whole transformers lib to avoid a load time penalty. PRs are welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants