-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Allow setting a max_tokens (max_completion_tokens in OpenAI API) for all requests. #11976
Comments
@DarkLight1337 thanks for the encouragement. I'll take a stab at making the PR. |
I've taken a first stab at this if someone wants to take a look and give some feedback.
High level description of the changes
Another possibility, if we don't like the generation_config.json being more than just default values, would be to add a Feedback welcome |
Thanks for your efforts! This looks reasonable, can you open a PR? I think it's not necessary to update |
I've made the PR. I've left in the |
🚀 The feature, motivation and pitch
I'm running vLLM for production LLM hosting and would like to cap the max_tokens (total number of generated output tokens) for all requests. Currently when using the OpenAI API server, the default_max_tokens is calculated to be the context_window - prompt tokens. However, for models like Llama-3.1 which has a context window of 128K, this far too large.
Alternatives
One potential solution would allow having
max_new_tokens
be specified in the generation_config.json file that would be read at launch time. This could then become the server'smax_tokens
. Currently, only repetition_penalty, temperature, top_k, top_p, and min_p seem to be supportedThe code in openai/serving_completion would need to take into account that you want the minimum of the (max_model_len-prompt_tokens, generation_max_tokens).
In addition, the openai/protocol would need to take into account that a client's request max_tokens can't exceed the default_max_tokens value.
Additional context
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: