[Frontend][V1] Online serving performance improvements #12287

njhill · 2025-01-21T23:38:20Z

These help in particular with TTFT, ITL variance, and overall throughput.

Break up output processing (detokenization) to avoid blocking the event loop for too long
Freeze the heap after startup to reduce GC overhead/pauses
Optimize a couple of CPU hotspots seen during profiling

Benchmark on A100:

VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.2-1B-Instruct --disable-log-requests --port 8001 --max-num-batched-tokens 8192 --no-enable-prefix-caching --uvicorn-log-level=error

python benchmarks/benchmark_serving.py \
    --backend vllm \
    --model meta-llama/Llama-3.2-1B-Instruct \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --ignore-eos \
    --port 8001 \
    --save-result \
    --result-dir results \
    --result-filename test.json \
    --num-prompts 6000 \
    --request-rate inf \
    --max-concurrency=400

Before:

============ Serving Benchmark Result ============
Successful requests:                     6000      
Benchmark duration (s):                  94.31     
Total input tokens:                      1350511   
Total generated tokens:                  1211959   
Request throughput (req/s):              63.62     
Output token throughput (tok/s):         12850.45  
Total Token throughput (tok/s):          27169.98  
---------------Time to First Token----------------
Mean TTFT (ms):                          229.23    
Median TTFT (ms):                        158.08    
P99 TTFT (ms):                           1050.70   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          30.02     
Median TPOT (ms):                        29.64     
P99 TPOT (ms):                           68.90     
---------------Inter-token Latency----------------
Mean ITL (ms):                           28.77     
Median ITL (ms):                         23.19     
P99 ITL (ms):                            386.30    
==================================================

After:

============ Serving Benchmark Result ============
Successful requests:                     6000      
Benchmark duration (s):                  88.60     
Total input tokens:                      1350511   
Total generated tokens:                  1211959   
Request throughput (req/s):              67.72     
Output token throughput (tok/s):         13679.34  
Total Token throughput (tok/s):          28922.50  
---------------Time to First Token----------------
Mean TTFT (ms):                          197.34    
Median TTFT (ms):                        168.03    
P99 TTFT (ms):                           1059.55   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          28.30     
Median TPOT (ms):                        27.75     
P99 TPOT (ms):                           47.38     
---------------Inter-token Latency----------------
Mean ITL (ms):                           26.64     
Median ITL (ms):                         24.38     
P99 ITL (ms):                            65.19     
==================================================

github-actions · 2025-01-21T23:38:30Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

These help in particular with TTFT, and ITL variance. Overall throughput doesn't change much. - Break up output processing (detokenization) to avoid blocking the event loop for too long - Freeze the heap after startup to reduce GC overhead/pauses - Optimize a couple of CPU hotspots seen during profiling Signed-off-by: Nick Hill <[email protected]>

njhill · 2025-01-22T00:21:06Z

vllm/entrypoints/openai/protocol.py

@@ -42,23 +42,31 @@ class OpenAIBaseModel(BaseModel):
    # OpenAI API does allow extra fields
    model_config = ConfigDict(extra="allow")

+    # Cache class field names
+    field_names: ClassVar[Optional[Set[str]]] = None


There was noticeable overhead creating this set every time one of these objects is instantiated.

njhill · 2025-01-22T00:21:09Z

vllm/v1/request.py

-    def output_token_ids(self) -> ConstantList[int]:
-        # Prevent directly appending to the output_token_ids since
-        # all_token_ids should also be updated simultaneously.
-        return ConstantList(self._output_token_ids)


Avoid constructing these objects every time the properties are accessed.

Nice catch!

Signed-off-by: Nick Hill <[email protected]>

robertgshaw2-redhat · 2025-01-22T03:05:35Z

Wow, the impact on P99 ITL is crazy.

robertgshaw2-redhat · 2025-01-22T03:07:13Z

vllm/entrypoints/openai/api_server.py

+        # Mark the startup heap as static so that it's ignored by GC.
+        # Reduces pause times of oldest generation collections.
+        gc.collect()
+        gc.freeze()


Do we need to call unfreeze at some point?

No, this is mostly static stuff that will be around for the lifetime of the process anyhow.

https://www.rippling.com/blog/the-garbage-collector-fights-back

njhill requested review from WoosukKwon, robertgshaw2-redhat, ywang96, comaniac and alexm-redhat as code owners January 21, 2025 23:38

mergify bot added the frontend label Jan 21, 2025

njhill force-pushed the v1-perf-smoothing branch from cfc5705 to 55dd119 Compare January 21, 2025 23:39

njhill commented Jan 22, 2025

View reviewed changes

Parallelize output socket IO on client side

0e92b61

Signed-off-by: Nick Hill <[email protected]>

robertgshaw2-redhat reviewed Jan 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Frontend][V1] Online serving performance improvements #12287

[Frontend][V1] Online serving performance improvements #12287

njhill commented Jan 21, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 21, 2025

njhill Jan 22, 2025

njhill Jan 22, 2025

WoosukKwon Jan 22, 2025

robertgshaw2-redhat commented Jan 22, 2025

robertgshaw2-redhat Jan 22, 2025

njhill Jan 22, 2025

[Frontend][V1] Online serving performance improvements #12287

Are you sure you want to change the base?

[Frontend][V1] Online serving performance improvements #12287

Conversation

njhill commented Jan 21, 2025 • edited by github-actions bot Loading

Before:

After:

github-actions bot commented Jan 21, 2025

njhill Jan 22, 2025

Choose a reason for hiding this comment

njhill Jan 22, 2025

Choose a reason for hiding this comment

WoosukKwon Jan 22, 2025

Choose a reason for hiding this comment

robertgshaw2-redhat commented Jan 22, 2025

robertgshaw2-redhat Jan 22, 2025

Choose a reason for hiding this comment

njhill Jan 22, 2025

Choose a reason for hiding this comment

njhill commented Jan 21, 2025 •

edited by github-actions bot

Loading