-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Frontend] Set server's maximum number of generated tokens using generation_config.json #12242
base: main
Are you sure you want to change the base?
Conversation
ModelConfig.get_diff_sampling_params() now allows for reading the "max_new_tokens" if its specified in the generation_config.json file. This follows Huggingface's naming convention for the variable that specifies the maximum number of generated tokens. This gets renamed to "max_tokens" to follow the naming convention used by vLLM for the same functionality.
Previously the default_max_tokens was the (max_model_len - prompt_tokens), but now the server_max_tokens = min( max_model_len - prompt_tokens, max_tokens if set in generation_config.json )
server_max_tokens is the minimum between architectural limitations, which was the default_max_tokens, and max_new_tokens set in generation_config.json
I have int's so that wasn't good. I could have gone with 2**64 or something, but didn't like the idea of some hardcoded value. So changed a logic just a touch.
Also added in setting server_max_tokens to the minimum of context window - prompt and the value of max_new_tokens set in generation_config.json
server_max_tokens is set either by architectural limits (context_window - prompt_tokens) or max_new_tokens value set in generation_config.json
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
You can add tests under |
This PR adds in the ability to specify the
max_new_tokens
entry in the generation_config.json file. Setting this parameter will act as the server's maximum number of generated tokens any given request can have.If a user does not specify a
max_tokens
in their request, then the minimum ofmax_new_tokens
and (max_model_len
- prompt_tokens) will be used. Current behavior just uses themax_model_len
- prompt_tokens as a default. For large context window models, e.g., Llama-3.1, this is 128K.If a user does specify a
max_tokens
, then the minimum ofmax_tokens
,max_new_tokens
, ormax_model_len
- prompt_tokens will be used. Current behavior allows user to specifymax_tokens
larger than is physically allowed by the context window, though it will throw an error. Now it would quietly override user's requestedmax_tokens
.Still need to add some testing, but would appreciate some pointers to where the tests should be added. @DarkLight1337 suggested it may not be necessary to update the
vllm/entrypoints/llm.py
, but I've left that in for now until final determination can be made.FIX #11976