[Frontend] Set server's maximum number of generated tokens using generation_config.json #12242

mhendrey · 2025-01-21T04:08:23Z

This PR adds in the ability to specify the max_new_tokens entry in the generation_config.json file. Setting this parameter will act as the server's maximum number of generated tokens any given request can have.

If a user does not specify a max_tokens in their request, then the minimum of max_new_tokens and (max_model_len - prompt_tokens) will be used. Current behavior just uses the max_model_len - prompt_tokens as a default. For large context window models, e.g., Llama-3.1, this is 128K.

If a user does specify a max_tokens, then the minimum of max_tokens, max_new_tokens, or max_model_len - prompt_tokens will be used. Current behavior allows user to specify max_tokens larger than is physically allowed by the context window, though it will throw an error. Now it would quietly override user's requested max_tokens.

Still need to add some testing, but would appreciate some pointers to where the tests should be added. @DarkLight1337 suggested it may not be necessary to update the vllm/entrypoints/llm.py, but I've left that in for now until final determination can be made.

FIX #11976

ModelConfig.get_diff_sampling_params() now allows for reading the "max_new_tokens" if its specified in the generation_config.json file. This follows Huggingface's naming convention for the variable that specifies the maximum number of generated tokens. This gets renamed to "max_tokens" to follow the naming convention used by vLLM for the same functionality.

Previously the default_max_tokens was the (max_model_len - prompt_tokens), but now the server_max_tokens = min( max_model_len - prompt_tokens, max_tokens if set in generation_config.json )

server_max_tokens is the minimum between architectural limitations, which was the default_max_tokens, and max_new_tokens set in generation_config.json

I have int's so that wasn't good. I could have gone with 2**64 or something, but didn't like the idea of some hardcoded value. So changed a logic just a touch.

Also added in setting server_max_tokens to the minimum of context window - prompt and the value of max_new_tokens set in generation_config.json

server_max_tokens is set either by architectural limits (context_window - prompt_tokens) or max_new_tokens value set in generation_config.json

github-actions · 2025-01-21T04:08:35Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

DarkLight1337 · 2025-01-21T05:24:42Z

You can add tests under tests/entrypoints/openai

mhendrey added 9 commits January 19, 2025 23:13

Changed default_max_tokens to server_max_tokens

830c6c4

Previously the default_max_tokens was the (max_model_len - prompt_tokens), but now the server_max_tokens = min( max_model_len - prompt_tokens, max_tokens if set in generation_config.json )

Renamed default_max_tokens to server_max_tokens

4071082

server_max_tokens is the minimum between architectural limitations, which was the default_max_tokens, and max_new_tokens set in generation_config.json

Removed the float("inf") bug

7784cc0

I have int's so that wasn't good. I could have gone with 2**64 or something, but didn't like the idea of some hardcoded value. So changed a logic just a touch.

Renamed default_max_tokens to server_max_tokens

17eb272

Also added in setting server_max_tokens to the minimum of context window - prompt and the value of max_new_tokens set in generation_config.json

Rearranged lines to make the changes with existing as small as possible

cb127ae

Limit generated tokens by server's max_tokens setting when available

e042818

server_max_tokens is set either by architectural limits (context_window - prompt_tokens) or max_new_tokens value set in generation_config.json

Changed syntax to pass format.sh tests

6b8a27f

Merge branch 'vllm-project:main' into main

791e92f

mergify bot added the frontend label Jan 21, 2025

mhendrey mentioned this pull request Jan 21, 2025

[Feature]: Allow setting a max_tokens (max_completion_tokens in OpenAI API) for all requests. #11976

Open

1 task

DarkLight1337 changed the title ~~Enable setting server's maximum number of generated tokens using generation_config.json~~ [Frontend] Set server's maximum number of generated tokens using generation_config.json Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Frontend] Set server's maximum number of generated tokens using generation_config.json #12242

[Frontend] Set server's maximum number of generated tokens using generation_config.json #12242

mhendrey commented Jan 21, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 21, 2025

DarkLight1337 commented Jan 21, 2025

[Frontend] Set server's maximum number of generated tokens using generation_config.json #12242

Are you sure you want to change the base?

[Frontend] Set server's maximum number of generated tokens using generation_config.json #12242

Conversation

mhendrey commented Jan 21, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 21, 2025

DarkLight1337 commented Jan 21, 2025

mhendrey commented Jan 21, 2025 •

edited by github-actions bot

Loading