Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Allow setting a max_tokens (max_completion_tokens in OpenAI API) for all requests. #11976

Open
1 task done
mhendrey opened this issue Jan 13, 2025 · 4 comments
Open
1 task done
Labels

Comments

@mhendrey
Copy link

🚀 The feature, motivation and pitch

I'm running vLLM for production LLM hosting and would like to cap the max_tokens (total number of generated output tokens) for all requests. Currently when using the OpenAI API server, the default_max_tokens is calculated to be the context_window - prompt tokens. However, for models like Llama-3.1 which has a context window of 128K, this far too large.

Alternatives

One potential solution would allow having max_new_tokens be specified in the generation_config.json file that would be read at launch time. This could then become the server's max_tokens. Currently, only repetition_penalty, temperature, top_k, top_p, and min_p seem to be supported

The code in openai/serving_completion would need to take into account that you want the minimum of the (max_model_len-prompt_tokens, generation_max_tokens).

In addition, the openai/protocol would need to take into account that a client's request max_tokens can't exceed the default_max_tokens value.

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@DarkLight1337 DarkLight1337 added the good first issue Good for newcomers label Jan 13, 2025
@mhendrey
Copy link
Author

@DarkLight1337 thanks for the encouragement. I'll take a stab at making the PR.

@mhendrey
Copy link
Author

I've taken a first stab at this if someone wants to take a look and give some feedback.

  • generation_config.json can now have max_new_tokens specified
    • Huggingface's max_new_tokens is the same as vLLM's max_tokens. Since generation_config.json is coming from Huggingface I elected to keep their naming convention in the generation_config.json file, but switch to max_tokens upon reading in the file
  • I treat the value provided in generation_config.json differently than the others. The others act as defaults that user requests can override. The max_new_tokens is taken to be the maximum that the server is willing to handle. If a user submits a request with max_tokens > max_new_tokens, then their max_tokens is silently set to max_new_tokens value.

High level description of the changes

  • vllm/config.py - Added max_new_tokens to list of available params and renamed it max_tokens if it is present in the generation_config.json file
  • vllm/entrypoints/openai/protocol.py - Renamed default_max_tokens to server_max_tokens. The request's max_tokens is set to minimum between server_max_tokens and user's requested max_tokens.
  • vllm/entrypoints/openai/serving_chat.py - Renamed default_max_tokens to server_max_tokens. Value set to be the minimum between (max_model_len - len(prompt_token_ids) and max_new_tokens in generation_config.json.
  • vllm/entrypoints/openai/serving_completions.py - Renamed default_max_tokens to server_max_tokens. Value set to be the minimum between (max_model_len - len(prompt_token_ids) and max_new_tokens in generation_config.json
  • vllm/entrypoints/llm.py - In _validate_and_add_requests(), the SamplingParams.max_tokens is set to be minimum of user's requested value and max_new_tokens in generation_config.json.

Another possibility, if we don't like the generation_config.json being more than just default values, would be to add a server_max_tokens to the engine itself. Either as a new engine argument or perhaps reading from generation_config.json, but setting a new engine.server_max_tokens that would be checked against user requests.

Feedback welcome

@DarkLight1337
Copy link
Member

Thanks for your efforts! This looks reasonable, can you open a PR? I think it's not necessary to update vllm/entrypoints/llm.py since the checks are supposedly done in vllm/entrypoints/openai already.

@mhendrey
Copy link
Author

I've made the PR. I've left in the vllm/entrypoints/llm.py changes for now, until a final determination can be made. Many thanks for considering this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants