-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Frontend][V1] Online serving performance improvements #12287
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
These help in particular with TTFT, and ITL variance. Overall throughput doesn't change much. - Break up output processing (detokenization) to avoid blocking the event loop for too long - Freeze the heap after startup to reduce GC overhead/pauses - Optimize a couple of CPU hotspots seen during profiling Signed-off-by: Nick Hill <[email protected]>
cfc5705
to
55dd119
Compare
@@ -42,23 +42,31 @@ class OpenAIBaseModel(BaseModel): | |||
# OpenAI API does allow extra fields | |||
model_config = ConfigDict(extra="allow") | |||
|
|||
# Cache class field names | |||
field_names: ClassVar[Optional[Set[str]]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was noticeable overhead creating this set every time one of these objects is instantiated.
def output_token_ids(self) -> ConstantList[int]: | ||
# Prevent directly appending to the output_token_ids since | ||
# all_token_ids should also be updated simultaneously. | ||
return ConstantList(self._output_token_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid constructing these objects every time the properties are accessed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
Signed-off-by: Nick Hill <[email protected]>
Wow, the impact on P99 ITL is crazy. |
# Mark the startup heap as static so that it's ignored by GC. | ||
# Reduces pause times of oldest generation collections. | ||
gc.collect() | ||
gc.freeze() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to call unfreeze at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is mostly static stuff that will be around for the lifetime of the process anyhow.
https://www.rippling.com/blog/the-garbage-collector-fights-back
These help in particular with TTFT, ITL variance, and overall throughput.
Benchmark on A100:
Before:
After: