Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Upstream codebase diff #470

Draft
wants to merge 1,504 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1504 commits
Select commit Hold shift + click to select a range
e1a5c2f
[Model] Whisper model implementation (#11280)
aurickq Jan 3, 2025
80c751e
[V1] Simplify Shutdown (#11659)
robertgshaw2-redhat Jan 3, 2025
61fed92
[Bugfix] Fix ColumnParallelLinearWithLoRA slice (#11708)
zinccat Jan 3, 2025
1543914
[V1] Improve TP>1 Error Handling + Stack Trace (#11721)
robertgshaw2-redhat Jan 3, 2025
a655eb3
[Misc]Add BNB quantization for Qwen2VL (#11719)
jeejeelee Jan 3, 2025
bf0d97d
Update requirements-tpu.txt to support python 3.9 and 3.11 (#11695)
mgoin Jan 3, 2025
ad0d567
[V1] Chore: cruft removal (#11724)
robertgshaw2-redhat Jan 3, 2025
e5d7ed0
[V1] log GPU blocks num for MultiprocExecutor (#11656)
WangErXiao Jan 4, 2025
9c93636
Update tool_calling.md (#11701)
Bryce1010 Jan 4, 2025
d1d4939
Update bnb.md with example for OpenAI (#11718)
bet0x Jan 4, 2025
fbf2564
[V1] Add `RayExecutor` support for `AsyncLLM` (api server) (#11712)
jikunshang Jan 4, 2025
d91457d
[V1] Add kv cache utils tests. (#11513)
xcnick Jan 4, 2025
300acb8
[Core][Bugfix] Use correct device to initialize GPU data during CUDA-…
yanburman Jan 4, 2025
eed11eb
[VLM] Merged multi-modal processors for LLaVA-NeXT-Video and LLaVA-On…
DarkLight1337 Jan 4, 2025
ba214df
[Bugfix] Fix precision error in LLaVA-NeXT (#11735)
DarkLight1337 Jan 4, 2025
65c0892
[Model] Remove unnecessary weight initialization logic (#11736)
DarkLight1337 Jan 4, 2025
4783143
[Bugfix][V1] Fix test_kv_cache_utils.py (#11738)
jeejeelee Jan 4, 2025
4068f4b
[MISC] Replace c10::optional with std::optional (#11730)
houseroad Jan 5, 2025
635b897
[distributed] remove pynccl's redundant stream (#11744)
cennn Jan 5, 2025
eba1717
fix: [doc] fix typo (#11751)
RuixiangMa Jan 5, 2025
33fc1e2
[Frontend] Improve `StreamingResponse` Exception Handling (#11752)
robertgshaw2-redhat Jan 5, 2025
9e764e7
[distributed] remove pynccl's redundant change_state (#11749)
cennn Jan 6, 2025
402d378
[Doc] [1/N] Reorganize Getting Started section (#11645)
DarkLight1337 Jan 6, 2025
408e560
[Bugfix] Remove block size constraint (#11723)
comaniac Jan 6, 2025
06bfb51
[V1] Add BlockTable class (#11693)
WoosukKwon Jan 6, 2025
f8fcca1
[Misc] Fix typo for valid_tool_parses (#11753)
ruisearch42 Jan 6, 2025
022c5c6
[V1] Refactor get_executor_cls (#11754)
ruisearch42 Jan 6, 2025
9c74971
[mypy] Forward pass function type hints in lora (#11740)
lucas-tucker Jan 6, 2025
2a622d7
k8s-config: Update the secret to use stringData (#11679)
surajssd Jan 6, 2025
996357e
[VLM] Separate out profiling-related logic (#11746)
DarkLight1337 Jan 6, 2025
ee77fdb
[Doc][2/N] Reorganize Models and Usage sections (#11755)
DarkLight1337 Jan 6, 2025
9279b9f
[Bugfix] Fix max image size for LLaVA-Onevision (#11769)
ywang96 Jan 6, 2025
4ca5d40
[doc] explain how to add interleaving sliding window support (#11771)
youkaichao Jan 6, 2025
32c9eff
[Bugfix][V1] Fix molmo text-only inputs (#11676)
jeejeelee Jan 6, 2025
e20c92b
[Kernel] Move attn_type to Attention.__init__() (#11690)
heheda12345 Jan 6, 2025
91b361a
[V1] Extend beyond image modality and support mixed-modality inferenc…
ywang96 Jan 6, 2025
08fb75c
[Bugfix] Fix LLaVA-NeXT feature size precision error (for real) (#11772)
DarkLight1337 Jan 7, 2025
d0169e1
[Model] Future-proof Qwen2-Audio multi-modal processor (#11776)
DarkLight1337 Jan 7, 2025
d93d2d7
[XPU] Make pp group initilized for pipeline-parallelism (#11648)
ys950902 Jan 7, 2025
8ceffbf
[Doc][3/N] Reorganize Serving section (#11766)
DarkLight1337 Jan 7, 2025
b278557
[Kernel][LoRA]Punica prefill kernels fusion (#11234)
jeejeelee Jan 7, 2025
0f3f3c8
[Bugfix] Update attention interface in `Whisper` (#11784)
ywang96 Jan 7, 2025
898cdf0
[CI] Fix neuron CI and run offline tests (#11779)
liangfu Jan 7, 2025
e512f76
fix init error for MessageQueue when n_local_reader is zero (#11768)
XiaobingSuper Jan 7, 2025
ce1917f
[Doc] Create a vulnerability management team (#9925)
russellb Jan 7, 2025
1e4ce29
[CI][CPU] adding build number to docker image name (#11788)
zhouyuan Jan 7, 2025
2d24be7
[BUG fix] Rebase caused spec decode fix (#613)
xuechendi Jan 7, 2025
27a22ab
fix slow sampling when repetition_penalty is set. (#584)
ccrhx4 Jan 7, 2025
9d6917f
Optimize for topk=1 case if we do not handle duplicates (#603)
ssarkar2 Jan 7, 2025
8082ad7
[V1][Doc] Update V1 support for `LLaVa-NeXT-Video` (#11798)
ywang96 Jan 7, 2025
8f37be3
[Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calcula…
DarkLight1337 Jan 7, 2025
869e829
[doc] add doc to explain how to use uv (#11773)
youkaichao Jan 7, 2025
5d582b5
[bugfix] fix RuntimeError on apc (#648)
kkimmk Jan 7, 2025
2de197b
[V1] Support audio language models on V1 (#11733)
ywang96 Jan 7, 2025
d9fa1c0
[doc] update how pip can install nightly wheels (#11806)
youkaichao Jan 7, 2025
c0efe92
[Doc] Add note to `gte-Qwen2` models (#11808)
DarkLight1337 Jan 7, 2025
869579a
[optimization] remove python function call for custom op (#11750)
youkaichao Jan 7, 2025
c994223
[Bugfix] update the prefix for qwen2 (#11795)
jiangjiadi Jan 7, 2025
973f5dc
[Doc]Add documentation for using EAGLE in vLLM (#11417)
sroy745 Jan 7, 2025
a4e2b26
[Bugfix] Significant performance drop on CPUs with --num-scheduler-st…
DamonFool Jan 8, 2025
5950f55
[Doc] Group examples into categories (#11782)
hmellor Jan 8, 2025
91445c7
[Bugfix] Fix image input for Pixtral-HF (#11741)
DarkLight1337 Jan 8, 2025
4d29e91
[Misc] sort torch profiler table by kernel timing (#11813)
divakar-amd Jan 8, 2025
dc71af0
Remove the duplicate imports of MultiModalKwargs and PlaceholderRange…
WangErXiao Jan 8, 2025
b640b19
Fixed docker build for ppc64le (#11518)
npanpaliya Jan 8, 2025
f4923cb
[OpenVINO] Fixed Docker.openvino build (#11732)
ilya-lavrenov Jan 8, 2025
f645eb6
[Bugfix] Add checks for LoRA and CPU offload (#11810)
jeejeelee Jan 8, 2025
259abd8
[Docs] reorganize sponsorship page (#11639)
simon-mo Jan 8, 2025
ef68eb2
[Bug] Fix pickling of `ModelConfig` when RunAI Model Streamer is used…
DarkLight1337 Jan 8, 2025
889e662
[misc] improve memory profiling (#11809)
youkaichao Jan 8, 2025
ad9f1aa
[doc] update wheels url (#11830)
youkaichao Jan 8, 2025
a1b2b86
[Docs] Update sponsor name: 'Novita' to 'Novita AI' (#11833)
simon-mo Jan 8, 2025
cfd3219
[Hardware][Apple] Native support for macOS Apple Silicon (#11696)
wallashss Jan 8, 2025
f121411
[torch.compile] consider relevant code in compilation cache (#11614)
youkaichao Jan 8, 2025
2a0596b
[VLM] Reorganize profiling/processing-related code (#11812)
DarkLight1337 Jan 8, 2025
585ca9a
Add llava support to benchmark_throuhput (#665)
adobrzyniewicz-habana Jan 8, 2025
8f53dee
Add mllama support to benchmark_throughput (#668)
kdamaszk Jan 8, 2025
aba8d6e
[Doc] Move examples into categories (#11840)
hmellor Jan 8, 2025
6cd40a5
[Doc][4/N] Reorganize API Reference (#11843)
DarkLight1337 Jan 8, 2025
2f70249
[CI/Build][Bugfix] Fix CPU CI image clean up (#11836)
bigPYJ1151 Jan 8, 2025
49a11e2
Add mark_step for encoder layers (#669)
yma11 Jan 8, 2025
78f4590
[Bugfix][XPU] fix silu_and_mul (#11823)
yma11 Jan 8, 2025
cccf363
Use FusedSDPA for MllamaVisionSdpaAttention (#620)
kdamaszk Jan 8, 2025
ca47e17
[Misc] Move some model utils into vision file (#11848)
DarkLight1337 Jan 8, 2025
fa9dbf2
Limit number of dummy cross attention blocks (#667)
kdamaszk Jan 8, 2025
5984499
[Doc] Expand Multimodal API Reference (#11852)
DarkLight1337 Jan 8, 2025
47de882
[Misc]add some explanations for BlockHashType (#11847)
WangErXiao Jan 8, 2025
56fe4c2
[TPU][Quantization] TPU `W8A8` (#11785)
robertgshaw2-redhat Jan 8, 2025
526de82
[Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup f…
rasmith Jan 8, 2025
3db0caf
[Docs] Add Google Cloud Meetup (#11864)
simon-mo Jan 8, 2025
615e4a5
[CI] Turn on basic correctness tests for V1 (#10864)
tlrmchlsmth Jan 9, 2025
1fe554b
treat do_lower_case in the same way as the sentence-transformers libr…
maxdebayser Jan 9, 2025
730e959
[Doc] Recommend uv and python 3.12 for quickstart guide (#11849)
mgoin Jan 9, 2025
d848800
[Misc] Move `print_*_once` from utils to logger (#11298)
DarkLight1337 Jan 9, 2025
a732900
[Doc] Intended links Python multiprocessing library (#11878)
guspan-tanadi Jan 9, 2025
310aca8
[perf]fix current stream (#11870)
youkaichao Jan 9, 2025
cbfb022
send placeholder_index_maps
adobrzyniewicz-habana Jan 9, 2025
0bd1ff4
[Bugfix] Override dunder methods of placeholder modules (#11882)
DarkLight1337 Jan 9, 2025
1d967ac
[Bugfix] fix beam search input errors and latency benchmark script (#…
yeqcharlotte Jan 9, 2025
65097ca
[Doc] Add model development API Reference (#11884)
DarkLight1337 Jan 9, 2025
73aaf71
[SW-197036] - use torch._scaled_mm with hpu (#660)
nirda7 Jan 9, 2025
405eb8e
[platform] Allow platform specify attention backend (#11609)
wangxiyuan Jan 9, 2025
bd82872
[ci]try to fix flaky multi-step tests (#11894)
youkaichao Jan 9, 2025
9a22834
[Misc] Provide correct Pixtral-HF chat template (#11891)
DarkLight1337 Jan 9, 2025
36f5303
[Docs] Add Modal to deployment frameworks (#11907)
charlesfrye Jan 9, 2025
c3cf54d
[Doc][5/N] Move Community and API Reference to the bottom (#11896)
DarkLight1337 Jan 10, 2025
b844b99
[VLM] Enable tokenized inputs for merged multi-modal processor (#11900)
DarkLight1337 Jan 10, 2025
3de2b1e
[Doc] Show default pooling method in a table (#11904)
DarkLight1337 Jan 10, 2025
cf5f000
[torch.compile] Hide KV cache behind torch.compile boundary (#11677)
heheda12345 Jan 10, 2025
ac2f3f7
[Bugfix] Validate lora adapters to avoid crashing server (#11727)
joerunde Jan 10, 2025
61af633
[BUGFIX] Fix `UnspecifiedPlatform` package name (#11916)
jikunshang Jan 10, 2025
d53575a
[ci] fix gh200 tests (#11919)
youkaichao Jan 10, 2025
d907be7
[misc] remove python function call for custom activation op (#11885)
cennn Jan 10, 2025
ef725fe
[platform] support pytorch custom op pluggable (#11328)
wangxiyuan Jan 10, 2025
d85c47d
Replace "online inference" with "online serving" (#11923)
hmellor Jan 10, 2025
241ad7b
[ci] Fix sampler tests (#11922)
youkaichao Jan 10, 2025
12664dd
[Doc] [1/N] Initial guide for merged multi-modal processor (#11925)
DarkLight1337 Jan 10, 2025
e411a64
Merge remote-tracking branch 'upstream/main' into HEAD
kzawora-intel Jan 10, 2025
ab1ca6d
make the code actually run
kzawora-intel Jan 10, 2025
f3ecf00
make linters happy
kzawora-intel Jan 10, 2025
20410b2
[platform] support custom torch.compile backend key (#11318)
wangxiyuan Jan 10, 2025
482cdc4
[Doc] Rename offline inference examples (#11927)
hmellor Jan 10, 2025
f33e033
[Docs] Fix docstring in `get_ip` function (#11932)
KuntaiDu Jan 10, 2025
5959564
Doc fix in `benchmark_long_document_qa_throughput.py` (#11933)
KuntaiDu Jan 10, 2025
aa1e77a
[Hardware][CPU] Support MOE models on x86 CPU (#11831)
bigPYJ1151 Jan 10, 2025
46fa98c
[Misc] Clean up debug code in Deepseek-V3 (#11930)
Isotr0py Jan 10, 2025
8a57940
[Misc] Update benchmark_prefix_caching.py fixed example usage (#11920)
remimin Jan 10, 2025
d45cbe7
[Bugfix] Check that number of images matches number of <|image|> toke…
tjohnson31415 Jan 10, 2025
c9f09a4
[mypy] Fix mypy warnings in api_server.py (#11941)
frreiss Jan 11, 2025
899136b
[ci] fix broken distributed-tests-4-gpus (#11937)
youkaichao Jan 11, 2025
2118d05
[Bugfix][SpecDecode] Adjust Eagle model architecture to align with in…
llsj14 Jan 11, 2025
c32a7c7
[Bugfix] fused_experts_impl wrong compute type for float32 (#11921)
shaochangxu Jan 11, 2025
7a3a83e
[CI/Build] Move model-specific multi-modal processing tests (#11934)
DarkLight1337 Jan 11, 2025
c5975f8
Handle LoRA specific changes in MSS (#675)
SanjuCSudhakaran Jan 11, 2025
a991f7d
[Doc] Basic guide for writing unit tests for new models (#11951)
DarkLight1337 Jan 11, 2025
d697dc0
[Bugfix] Fix RobertaModel loading (#11940)
NickLucche Jan 11, 2025
4b657d3
[Model] Add cogagent model support vLLM (#11742)
sixsixcoder Jan 11, 2025
b25cfab
[V1] Avoid sending text prompt to core engine (#11963)
ywang96 Jan 12, 2025
43f3d9e
[CI/Build] Add markdown linter (#11857)
rafvasq Jan 12, 2025
f967e51
[Model] Initialize support for Deepseek-VL2 models (#11578)
Isotr0py Jan 12, 2025
c83289e
[SW-201504] Trigger Internal Tests (#538)
RonBenMosheHabana Jan 12, 2025
8bddb73
[Hardware][CPU] Multi-LoRA implementation for the CPU backend (#11100)
Akshat-Tripathi Jan 12, 2025
263a870
[Hardware][TPU] workaround fix for MoE on TPU (#11764)
avshalomman Jan 12, 2025
9597a09
[V1][Core][1/n] Logging and Metrics (#11962)
robertgshaw2-redhat Jan 12, 2025
d14e98d
[Model] Support GGUF models newly added in `transformers` 4.46.0 (#9685)
Isotr0py Jan 13, 2025
619ae26
[V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (#11973)
robertgshaw2-redhat Jan 13, 2025
f7b3ba8
[MISC] fix typo in kv transfer send recv test (#11983)
yyccli Jan 13, 2025
9dd02d8
[Bug] Fix usage of `.transpose()` and `.view()` consecutively. (#11979)
liaoyanqing666 Jan 13, 2025
80ea3af
[CI][Spec Decode] fix: broken test for EAGLE model (#11972)
llsj14 Jan 13, 2025
cf6bbcb
[Misc] Fix Deepseek V2 fp8 kv-scale remapping (#11947)
Concurrensee Jan 13, 2025
c3f05b0
[Misc]Minor Changes about Worker (#11555)
noemotiovon Jan 13, 2025
89ce62a
[platform] add ray_device_key (#11948)
youkaichao Jan 13, 2025
c245ef0
Fix model OOM issue in llama-405 and mixtral - 2nd attempt (#644)
afierka-intel Jan 13, 2025
5340a30
Fix Max Token ID for Qwen-VL-Chat (#11980)
alex-jw-brooks Jan 13, 2025
0f8cafe
[Kernel] unified_attention for Attention.forward (#11967)
heheda12345 Jan 13, 2025
cd82499
[Doc][V1] Update model implementation guide for V1 support (#11998)
ywang96 Jan 13, 2025
e8c23ff
[Doc] Organise installation documentation into categories and tabs (#…
hmellor Jan 13, 2025
458e63a
[platform] add device_control env var (#12009)
youkaichao Jan 13, 2025
a7d5968
[Platform] Move get_punica_wrapper() function to Platform (#11516)
shen-shanshan Jan 13, 2025
eb0d42f
Add inc fp8 qunatization documentation (#635)
nirda7 Jan 13, 2025
c6db213
bugfix: Fix signature mismatch in benchmark's `get_tokenizer` functio…
e1ijah1 Jan 13, 2025
289b519
[Doc] Fix build from source and installation link in README.md (#12013)
Yikun Jan 13, 2025
f35ec46
[Bugfix] Fix deepseekv3 gate bias error (#12002)
SunflowerAries Jan 13, 2025
1a40125
[Docs] Add Sky Computing Lab to project intro (#12019)
WoosukKwon Jan 14, 2025
078da31
[HPU][Bugfix] set_forward_context and CI test execution (#12014)
kzawora-intel Jan 14, 2025
8a1f938
[Doc] Update Quantization Hardware Support Documentation (#12025)
tjtanaa Jan 14, 2025
f6b6092
Adds LoRA tests to vLLM CI pipeline (#680)
rsshaik1 Jan 14, 2025
132d40e
Update CODEOWNERS (#683)
michalkuligowski Jan 14, 2025
ff39141
[HPU][misc] add comments for explanation (#12034)
youkaichao Jan 14, 2025
bb354e6
[Bugfix] Fix various bugs in multi-modal processor (#12031)
DarkLight1337 Jan 14, 2025
1f18adb
[Kernel] Revert the API change of Attention.forward (#12038)
heheda12345 Jan 14, 2025
2e0e017
[Platform] Add output for Attention Backend (#11981)
wangxiyuan Jan 14, 2025
f51e265
Merge remote-tracking branch 'upstream/main' into private/kzawora/jan…
kzawora-intel Jan 14, 2025
a2d2acb
[Bugfix][Kernel] Give unique name to BlockSparseFlashAttention (#12040)
heheda12345 Jan 14, 2025
c9d6ff5
Explain where the engine args go when using Docker (#12041)
hmellor Jan 14, 2025
ca8cb82
Merge remote-tracking branch 'origin/habana_main' into private/kzawor…
kzawora-intel Jan 14, 2025
7d13823
linter updates + bugfixes
kzawora-intel Jan 14, 2025
87054a5
[Doc]: Update the Json Example of the `Engine Arguments` document (#1…
maang-h Jan 14, 2025
a3a3ee4
[Misc] Merge bitsandbytes_stacked_params_mapping and packed_modules_…
jeejeelee Jan 14, 2025
42f5e7c
[Kernel] Support MulAndSilu (#11624)
jeejeelee Jan 15, 2025
1a51b9f
[HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in se…
kzawora-intel Jan 15, 2025
9ddac56
[Platform] move current_memory_usage() into platform (#11369)
shen-shanshan Jan 15, 2025
b7ee940
[V1][BugFix] Fix edge case in VLM scheduling (#12065)
WoosukKwon Jan 15, 2025
0794e74
[Misc] Add multipstep chunked-prefill support for FlashInfer (#10467)
elfiegg Jan 15, 2025
f218f9c
[core] Turn off GPU communication overlap for Ray executor (#12051)
ruisearch42 Jan 15, 2025
ad34c0d
[core] platform agnostic executor via collective_rpc (#11256)
youkaichao Jan 15, 2025
3f9b7ab
[Doc] Update examples to remove SparseAutoModelForCausalLM (#12062)
kylesayrs Jan 15, 2025
994fc65
[V1][Prefix Cache] Move the logic of num_computed_tokens into KVCache…
heheda12345 Jan 15, 2025
cbe9439
Fix: cases with empty sparsity config (#12057)
rahul-tuli Jan 15, 2025
ad388d2
Type-fix: make execute_model output type optional (#12020)
youngkent Jan 15, 2025
3adf0ff
[Platform] Do not raise error if _Backend is not found (#12023)
wangxiyuan Jan 15, 2025
97eb97b
[Model]: Support internlm3 (#12037)
RunningLeon Jan 15, 2025
5ecf3e0
Misc: allow to use proxy in `HTTPConnection` (#12042)
zhouyuan Jan 15, 2025
885c60d
Set vllm-hpu-extension to 6ac93fb (#684)
mfylcek Jan 15, 2025
aeebe54
Set cache size for t.compile even if there is no warmup (#689)
anko-intel Jan 15, 2025
de0526f
[Misc][Quark] Upstream Quark format to VLLM (#10765)
kewang-xlnx Jan 15, 2025
57e729e
[Doc]: Update `OpenAI-Compatible Server` documents (#12082)
maang-h Jan 15, 2025
edce722
[Bugfix] use right truncation for non-generative tasks (#12050)
joerunde Jan 15, 2025
47391dc
Jan 10 rebase (#677)
kzawora-intel Jan 15, 2025
70755e8
[V1][Core] Autotune encoder cache budget (#11895)
ywang96 Jan 15, 2025
ebd8c66
[Bugfix] Fix _get_lora_device for HQQ marlin (#12090)
varun-sundar-rabindranath Jan 15, 2025
cd9d06f
Allow hip sources to be directly included when compiling for rocm. (#…
tvirolai-amd Jan 15, 2025
fa0050d
[Core] Default to using per_token quantization for fp8 when cutlass i…
elfiegg Jan 16, 2025
9af82cd
Workaround to handle multi-card stall issue (#688)
SanjuCSudhakaran Jan 16, 2025
f8ef146
[Doc] Add documentation for specifying model architecture (#12105)
DarkLight1337 Jan 16, 2025
567f7e7
Merge branch 'habana_main' into adobrzyniewicz/multimodality_for_llava
adobrzyniewicz-habana Jan 16, 2025
9aa1519
Various cosmetic/comment fixes (#12089)
mgoin Jan 16, 2025
dd7c9ad
[Bugfix] Remove hardcoded `head_size=256` for Deepseek v2 and v3 (#12…
Isotr0py Jan 16, 2025
40bb71f
Fix weights load device use (#686)
nirda7 Jan 16, 2025
aaaac6c
format
adobrzyniewicz-habana Jan 16, 2025
a3197c6
Merge branch 'habana_main' into adobrzyniewicz/multimodality_for_llava
adobrzyniewicz-habana Jan 16, 2025
bf53e0c
Support torchrun and SPMD-style offline inference (#12071)
youkaichao Jan 16, 2025
92e793d
[core] LLM.collective_rpc interface and RLHF example (#12084)
youkaichao Jan 16, 2025
b3a0db2
Move scores to float32 in case of running xgrammar on cpu (#695)
madamczykhabana Jan 16, 2025
874f7c2
[Bugfix] Fix max image feature size for Llava-one-vision (#12104)
ywang96 Jan 16, 2025
5fd24ec
[misc] Add LoRA kernel micro benchmarks (#11579)
varun-sundar-rabindranath Jan 16, 2025
62b06ba
[Model] Add support for deepseek-vl2-tiny model (#12068)
Isotr0py Jan 16, 2025
d06e824
[Bugfix] Set enforce_eager automatically for mllama (#12127)
heheda12345 Jan 16, 2025
ebc73f2
[Bugfix] Fix a path bug in disaggregated prefill example script. (#12…
KuntaiDu Jan 17, 2025
4db525d
Clean-up LoRA flow (#518)
SanjuCSudhakaran Jan 17, 2025
fead53b
[CI]add genai-perf benchmark in nightly benchmark (#10704)
jikunshang Jan 17, 2025
1475847
[Doc] Add instructions on using Podman when SELinux is active (#12136)
terrytangyuan Jan 17, 2025
b8bfa46
[Bugfix] Fix issues in CPU build Dockerfile (#12135)
terrytangyuan Jan 17, 2025
d1adb9b
[BugFix] add more `is not None` check in VllmConfig.__post_init__ (#1…
heheda12345 Jan 17, 2025
d75ab55
[Misc] Add deepseek_vl2 chat template (#12143)
Isotr0py Jan 17, 2025
8027a72
[ROCm][MoE] moe tuning support for rocm (#12049)
divakar-amd Jan 17, 2025
69d765f
[V1] Move more control of kv cache initialization from model_executor…
heheda12345 Jan 17, 2025
2d85682
Merge branch 'habana_main' into adobrzyniewicz/multimodality_for_llava
adobrzyniewicz-habana Jan 17, 2025
a685225
Check if kv_cache is tuple before calling split_kv_cache (#697)
kdamaszk Jan 17, 2025
a293e2e
Merge branch 'habana_main' into adobrzyniewicz/multimodality_for_llava
adobrzyniewicz-habana Jan 17, 2025
07934cc
[Misc][LoRA] Improve the readability of LoRA error messages (#12102)
jeejeelee Jan 17, 2025
d4e6194
[CI/Build][CPU][Bugfix] Fix CPU CI (#12150)
bigPYJ1151 Jan 17, 2025
87a0c07
[core] allow callable in collective_rpc (#12151)
youkaichao Jan 17, 2025
7eea2df
[CI] Cleanup run_tests.sh logs (#700)
kzawora-intel Jan 17, 2025
ce50b1a
Merge remote-tracking branch 'upstream/main' into private/kzawora/reb…
kzawora-intel Jan 17, 2025
a128878
fix TP crashes
kzawora-intel Jan 17, 2025
2e53e75
make mypy happy
kzawora-intel Jan 17, 2025
21f5fb2
¿what the heck is incquark?
kzawora-intel Jan 17, 2025
f1e911d
i forgot brackets again
kzawora-intel Jan 17, 2025
ae67e4d
Multimodality fix for llava (#641)
adobrzyniewicz-habana Jan 17, 2025
018ce62
Rebase 2025-01-17 (#701)
kzawora-intel Jan 17, 2025
b10992b
Fix LoRA tests (#696)
SanjuCSudhakaran Jan 20, 2025
1252646
Updating README_GAUDI in habana_main (#690)
MohitIntel Jan 20, 2025
293bd87
Change vllm-hpu-extension revision to ae726d4
iboiko-habana Jan 20, 2025
cc069cb
Change vllm-hpu-extension revision to ae726d4 (#707)
iboiko-habana Jan 20, 2025
fedf706
Capabilities overhaul (#692)
madamczykhabana Jan 20, 2025
37eb4fc
[SW-216156] Fix mixtral Fused MoE issues after rebase (#708)
dudilester Jan 21, 2025
1df1c2c
Disable enforcing eager mode for mllama and deepseek_v3 on hpu (#713)
jkaniecki Jan 21, 2025
e977f2a
Fix for random sampler recompilations for incomplete batches (#663)
mfylcek Jan 22, 2025
a64571c
[SW-216413] - Fix new executors shutdown and shutdown_inc flow (#716)
nirda7 Jan 22, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
24 changes: 24 additions & 0 deletions .buildkite/generate_index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import argparse
import os

template = """<!DOCTYPE html>
<html>
<body>
<h1>Links for vLLM</h1/>
<a href="../{wheel_html_escaped}">{wheel}</a><br/>
</body>
</html>
"""

parser = argparse.ArgumentParser()
parser.add_argument("--wheel", help="The wheel path.", required=True)
args = parser.parse_args()

filename = os.path.basename(args.wheel)

with open("index.html", "w") as f:
print(f"Generated index.html for {args.wheel}")
# cloudfront requires escaping the '+' character
f.write(
template.format(wheel=filename,
wheel_html_escaped=filename.replace("+", "%2B")))
6 changes: 3 additions & 3 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,6 @@ while getopts "m:b:l:f:" OPT; do
done

lm_eval --model hf \
--model_args pretrained=$MODEL,parallelize=True \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
--model_args "pretrained=$MODEL,parallelize=True" \
--tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size "$BATCH_SIZE"
6 changes: 3 additions & 3 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,6 @@ while getopts "m:b:l:f:t:" OPT; do
done

lm_eval --model vllm \
--model_args pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend="ray",trust_remote_code=true,max_model_len=4096 \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
--model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096" \
--tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size "$BATCH_SIZE"
2 changes: 1 addition & 1 deletion .buildkite/lm-eval-harness/run-tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ while getopts "c:t:" OPT; do
done

# Parse list of configs.
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < $CONFIG
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < "$CONFIG"

for MODEL_CONFIG in "${MODEL_CONFIGS[@]}"
do
Expand Down
68 changes: 50 additions & 18 deletions .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
steps:
- label: "Wait for container to be ready"
key: wait-for-container-image
agents:
queue: A100
plugins:
Expand All @@ -9,16 +10,18 @@ steps:
- image: badouralix/curl-jq
command:
- sh .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
- wait

- label: "A100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: A100
depends_on: wait-for-container-image
plugins:
- kubernetes:
podSpec:
priorityClassName: perf-benchmark
containers:
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
- image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
resources:
Expand All @@ -41,20 +44,49 @@ steps:
- name: devshm
emptyDir:
medium: Memory
# - label: "H100"
# agents:
# queue: H100
# plugins:
# - docker#v5.11.0:
# image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
# command:
# - bash
# - .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
# mount-buildkite-agent: true
# propagate-environment: true
# ipc: host
# gpus: all
# environment:
# - VLLM_USAGE_SOURCE
# - HF_TOKEN

- label: "H200"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H200
depends_on: wait-for-container-image
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: 4,5,6,7
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN

#- block: "Run H100 Benchmark"
#key: block-h100
#depends_on: ~

- label: "H100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H100
depends_on: wait-for-container-image
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: all # see CUDA_VISIBLE_DEVICES for actual GPUs used
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,18 @@ def results_to_json(latency, throughput, serving):
throughput_results,
serving_results)

for df in [latency_results, serving_results, throughput_results]:
if df.empty:
continue

# Sort all dataframes by their respective "Test name" columns
df.sort_values(by="Test name", inplace=True)

# The GPUs sometimes come in format of "GPUTYPE\nGPUTYPE\n...",
# we want to turn it into "8xGPUTYPE"
df["GPU"] = df["GPU"].apply(
lambda x: f"{len(x.split('\n'))}x{x.split('\n')[0]}")

# get markdown tables
latency_md_table = tabulate(latency_results,
headers='keys',
Expand Down
63 changes: 25 additions & 38 deletions .buildkite/nightly-benchmarks/scripts/launch-server.sh
Original file line number Diff line number Diff line change
Expand Up @@ -50,58 +50,54 @@ launch_trt_server() {
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
git lfs install
cd tensorrtllm_backend
git checkout $trt_llm_version
tensorrtllm_backend_dir=$(pwd)
git checkout "$trt_llm_version"
git submodule update --init --recursive

# build trtllm engine
cd /tensorrtllm_backend
cd ./tensorrt_llm/examples/${model_type}
cd "./tensorrt_llm/examples/${model_type}"
python3 convert_checkpoint.py \
--model_dir ${model_path} \
--dtype ${model_dtype} \
--tp_size ${model_tp_size} \
--output_dir ${trt_model_path}
--model_dir "${model_path}" \
--dtype "${model_dtype}" \
--tp_size "${model_tp_size}" \
--output_dir "${trt_model_path}"
trtllm-build \
--checkpoint_dir ${trt_model_path} \
--checkpoint_dir "${trt_model_path}" \
--use_fused_mlp \
--reduce_fusion disable \
--workers 8 \
--gpt_attention_plugin ${model_dtype} \
--gemm_plugin ${model_dtype} \
--tp_size ${model_tp_size} \
--max_batch_size ${max_batch_size} \
--max_input_len ${max_input_len} \
--max_seq_len ${max_seq_len} \
--max_num_tokens ${max_num_tokens} \
--output_dir ${trt_engine_path}
--gpt_attention_plugin "${model_dtype}" \
--gemm_plugin "${model_dtype}" \
--tp_size "${model_tp_size}" \
--max_batch_size "${max_batch_size}" \
--max_input_len "${max_input_len}" \
--max_seq_len "${max_seq_len}" \
--max_num_tokens "${max_num_tokens}" \
--output_dir "${trt_engine_path}"

# handle triton protobuf files and launch triton server
cd /tensorrtllm_backend
mkdir triton_model_repo
cp -r all_models/inflight_batcher_llm/* triton_model_repo/
cd triton_model_repo
rm -rf ./tensorrt_llm/1/*
cp -r ${trt_engine_path}/* ./tensorrt_llm/1
cp -r "${trt_engine_path}"/* ./tensorrt_llm/1
python3 ../tools/fill_template.py -i tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,engine_dir:/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1,decoupled_mode:true,batching_strategy:inflight_fused_batching,batch_scheduler_policy:guaranteed_no_evict,exclude_input_in_output:true,triton_max_batch_size:2048,max_queue_delay_microseconds:0,max_beam_width:1,max_queue_size:2048,enable_kv_cache_reuse:false
python3 ../tools/fill_template.py -i preprocessing/config.pbtxt triton_max_batch_size:2048,tokenizer_dir:$model_path,preprocessing_instance_count:5
python3 ../tools/fill_template.py -i postprocessing/config.pbtxt triton_max_batch_size:2048,tokenizer_dir:$model_path,postprocessing_instance_count:5,skip_special_tokens:false
python3 ../tools/fill_template.py -i ensemble/config.pbtxt triton_max_batch_size:$max_batch_size
python3 ../tools/fill_template.py -i tensorrt_llm_bls/config.pbtxt triton_max_batch_size:$max_batch_size,decoupled_mode:true,accumulate_tokens:"False",bls_instance_count:1
python3 ../tools/fill_template.py -i preprocessing/config.pbtxt "triton_max_batch_size:2048,tokenizer_dir:$model_path,preprocessing_instance_count:5"
python3 ../tools/fill_template.py -i postprocessing/config.pbtxt "triton_max_batch_size:2048,tokenizer_dir:$model_path,postprocessing_instance_count:5,skip_special_tokens:false"
python3 ../tools/fill_template.py -i ensemble/config.pbtxt triton_max_batch_size:"$max_batch_size"
python3 ../tools/fill_template.py -i tensorrt_llm_bls/config.pbtxt "triton_max_batch_size:$max_batch_size,decoupled_mode:true,accumulate_tokens:False,bls_instance_count:1"
cd /tensorrtllm_backend
python3 scripts/launch_triton_server.py \
--world_size=${model_tp_size} \
--world_size="${model_tp_size}" \
--model_repo=/tensorrtllm_backend/triton_model_repo &

}

launch_tgi_server() {
model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
server_args=$(json2args "$server_params")

if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then
Expand Down Expand Up @@ -129,10 +125,7 @@ launch_tgi_server() {
launch_lmdeploy_server() {
model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
server_args=$(json2args "$server_params")

server_command="lmdeploy serve api_server $model \
Expand All @@ -149,10 +142,7 @@ launch_sglang_server() {

model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
server_args=$(json2args "$server_params")

if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then
Expand Down Expand Up @@ -185,10 +175,7 @@ launch_vllm_server() {

model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
server_args=$(json2args "$server_params")

if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then
Expand Down Expand Up @@ -217,19 +204,19 @@ launch_vllm_server() {

main() {

if [[ $CURRENT_LLM_SERVING_ENGINE == "trt" ]]; then
if [[ "$CURRENT_LLM_SERVING_ENGINE" == "trt" ]]; then
launch_trt_server
fi

if [[ $CURRENT_LLM_SERVING_ENGINE == "tgi" ]]; then
if [[ "$CURRENT_LLM_SERVING_ENGINE" == "tgi" ]]; then
launch_tgi_server
fi

if [[ $CURRENT_LLM_SERVING_ENGINE == "lmdeploy" ]]; then
if [[ "$CURRENT_LLM_SERVING_ENGINE" == "lmdeploy" ]]; then
launch_lmdeploy_server
fi

if [[ $CURRENT_LLM_SERVING_ENGINE == "sglang" ]]; then
if [[ "$CURRENT_LLM_SERVING_ENGINE" == "sglang" ]]; then
launch_sglang_server
fi

Expand Down
12 changes: 6 additions & 6 deletions .buildkite/nightly-benchmarks/scripts/nightly-annotate.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,10 @@ main() {
fi

# initial annotation
description="$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/nightly-descriptions.md"
#description="$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/nightly-descriptions.md"

# download results
cd $VLLM_SOURCE_CODE_LOC/benchmarks
cd "$VLLM_SOURCE_CODE_LOC/benchmarks"
mkdir -p results/
/workspace/buildkite-agent artifact download 'results/*nightly_results.json' results/
ls
Expand All @@ -30,15 +30,15 @@ main() {
/workspace/buildkite-agent artifact upload "results.zip"

# upload benchmarking scripts
cd $VLLM_SOURCE_CODE_LOC/
cd "$VLLM_SOURCE_CODE_LOC/"
zip -r nightly-benchmarks.zip .buildkite/ benchmarks/
/workspace/buildkite-agent artifact upload "nightly-benchmarks.zip"

cd $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/
cd "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/"
# upload benchmarking pipeline
/workspace/buildkite-agent artifact upload "nightly-pipeline.yaml"

cd $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/
cd "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/"
/workspace/buildkite-agent annotate --style "success" --context "nightly-benchmarks-results" --append < nightly-annotation.md


Expand Down Expand Up @@ -75,4 +75,4 @@ main() {
# /workspace/buildkite-agent annotate --style "success" --context "nightly-benchmarks-results" --append < nightly_results.md
}

main "$@"
main "$@"
Loading
Loading