[Feature]: Allow setting a max_tokens (max_completion_tokens in OpenAI API) for all requests. #11976

mhendrey · 2025-01-13T00:04:52Z

🚀 The feature, motivation and pitch

I'm running vLLM for production LLM hosting and would like to cap the max_tokens (total number of generated output tokens) for all requests. Currently when using the OpenAI API server, the default_max_tokens is calculated to be the context_window - prompt tokens. However, for models like Llama-3.1 which has a context window of 128K, this far too large.

Alternatives

One potential solution would allow having max_new_tokens be specified in the generation_config.json file that would be read at launch time. This could then become the server's max_tokens. Currently, only repetition_penalty, temperature, top_k, top_p, and min_p seem to be supported

The code in openai/serving_completion would need to take into account that you want the minimum of the (max_model_len-prompt_tokens, generation_max_tokens).

In addition, the openai/protocol would need to take into account that a client's request max_tokens can't exceed the default_max_tokens value.

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

mhendrey · 2025-01-19T20:14:45Z

@DarkLight1337 thanks for the encouragement. I'll take a stab at making the PR.

mhendrey · 2025-01-20T19:52:42Z

I've taken a first stab at this if someone wants to take a look and give some feedback.

generation_config.json can now have max_new_tokens specified
- Huggingface's max_new_tokens is the same as vLLM's max_tokens. Since generation_config.json is coming from Huggingface I elected to keep their naming convention in the generation_config.json file, but switch to max_tokens upon reading in the file
I treat the value provided in generation_config.json differently than the others. The others act as defaults that user requests can override. The max_new_tokens is taken to be the maximum that the server is willing to handle. If a user submits a request with max_tokens > max_new_tokens, then their max_tokens is silently set to max_new_tokens value.

High level description of the changes

vllm/config.py - Added max_new_tokens to list of available params and renamed it max_tokens if it is present in the generation_config.json file
vllm/entrypoints/openai/protocol.py - Renamed default_max_tokens to server_max_tokens. The request's max_tokens is set to minimum between server_max_tokens and user's requested max_tokens.
vllm/entrypoints/openai/serving_chat.py - Renamed default_max_tokens to server_max_tokens. Value set to be the minimum between (max_model_len - len(prompt_token_ids) and max_new_tokens in generation_config.json.
vllm/entrypoints/openai/serving_completions.py - Renamed default_max_tokens to server_max_tokens. Value set to be the minimum between (max_model_len - len(prompt_token_ids) and max_new_tokens in generation_config.json
vllm/entrypoints/llm.py - In _validate_and_add_requests(), the SamplingParams.max_tokens is set to be minimum of user's requested value and max_new_tokens in generation_config.json.

Another possibility, if we don't like the generation_config.json being more than just default values, would be to add a server_max_tokens to the engine itself. Either as a new engine argument or perhaps reading from generation_config.json, but setting a new engine.server_max_tokens that would be checked against user requests.

Feedback welcome

DarkLight1337 · 2025-01-21T03:49:41Z

Thanks for your efforts! This looks reasonable, can you open a PR? I think it's not necessary to update vllm/entrypoints/llm.py since the checks are supposedly done in vllm/entrypoints/openai already.

mhendrey · 2025-01-21T04:09:43Z

I've made the PR. I've left in the vllm/entrypoints/llm.py changes for now, until a final determination can be made. Many thanks for considering this PR.

mhendrey added the feature request label Jan 13, 2025

DarkLight1337 added the good first issue Good for newcomers label Jan 13, 2025

mhendrey mentioned this issue Jan 21, 2025

[Frontend] Set server's maximum number of generated tokens using generation_config.json #12242

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Allow setting a max_tokens (max_completion_tokens in OpenAI API) for all requests. #11976

[Feature]: Allow setting a max_tokens (max_completion_tokens in OpenAI API) for all requests. #11976

mhendrey commented Jan 13, 2025

mhendrey commented Jan 19, 2025

mhendrey commented Jan 20, 2025

DarkLight1337 commented Jan 21, 2025

mhendrey commented Jan 21, 2025

[Feature]: Allow setting a max_tokens (max_completion_tokens in OpenAI API) for all requests. #11976

[Feature]: Allow setting a max_tokens (max_completion_tokens in OpenAI API) for all requests. #11976

Comments

mhendrey commented Jan 13, 2025

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

mhendrey commented Jan 19, 2025

mhendrey commented Jan 20, 2025

DarkLight1337 commented Jan 21, 2025

mhendrey commented Jan 21, 2025