Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend] Set server's maximum number of generated tokens using generation_config.json #12242

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

mhendrey
Copy link

@mhendrey mhendrey commented Jan 21, 2025

This PR adds in the ability to specify the max_new_tokens entry in the generation_config.json file. Setting this parameter will act as the server's maximum number of generated tokens any given request can have.

If a user does not specify a max_tokens in their request, then the minimum of max_new_tokens and (max_model_len - prompt_tokens) will be used. Current behavior just uses the max_model_len - prompt_tokens as a default. For large context window models, e.g., Llama-3.1, this is 128K.

If a user does specify a max_tokens, then the minimum of max_tokens, max_new_tokens, or max_model_len - prompt_tokens will be used. Current behavior allows user to specify max_tokens larger than is physically allowed by the context window, though it will throw an error. Now it would quietly override user's requested max_tokens.

Still need to add some testing, but would appreciate some pointers to where the tests should be added. @DarkLight1337 suggested it may not be necessary to update the vllm/entrypoints/llm.py, but I've left that in for now until final determination can be made.

FIX #11976

ModelConfig.get_diff_sampling_params() now allows for reading the "max_new_tokens"
if its specified in the generation_config.json file. This follows Huggingface's
naming convention for the variable that specifies the maximum number of generated tokens.
This gets renamed to "max_tokens" to follow the naming convention used by vLLM for the same functionality.
Previously the default_max_tokens was the (max_model_len - prompt_tokens),
but now the server_max_tokens = min(
  max_model_len - prompt_tokens,
  max_tokens if set in generation_config.json
)
server_max_tokens is the minimum between architectural limitations,
which was the default_max_tokens, and max_new_tokens set in generation_config.json
I have int's so that wasn't good. I could have gone with 2**64 or something, but
didn't like the idea of some hardcoded value. So changed a logic just a touch.
Also added in setting server_max_tokens to the minimum of context window - prompt and the value of max_new_tokens set in generation_config.json
server_max_tokens is set either by architectural limits (context_window - prompt_tokens) or max_new_tokens value set in generation_config.json
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@mergify mergify bot added the frontend label Jan 21, 2025
@DarkLight1337 DarkLight1337 changed the title Enable setting server's maximum number of generated tokens using generation_config.json [Frontend] Set server's maximum number of generated tokens using generation_config.json Jan 21, 2025
@DarkLight1337
Copy link
Member

You can add tests under tests/entrypoints/openai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants