-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Online video support for VLMs #10020
Online video support for VLMs #10020
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good, but please add some tests to verify this.
This pull request has merge conflicts that must be resolved before it can be |
OK , I have updated the tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this!
It looks like the tests failed though, PTAL. |
The tests succeeded in my local machine. |
|
Excuse me, I have a question to ask, if we want to use vllm to call the qwen2-lv model, when we want to pass the video as input, if I extract the frame and crop it on the client to get the desired video frame sequence, call the chat interface of vllm by passing the picture sequence or call the chat interface by directly passing the video and the sampling frequency and the desired size implemented in this pr, will there be any difference in the results of these two calling methods? I understand that video passing video will bring more network bandwidth pressure and latency, thanks for the answer! |
The HF processor will be called regardless of whether you have done preprocessing beforehand. I am not sure whether HF processor is intelligent enough to return early if the image has already been cropped though. |
Signed-off-by: DarkLight1337 <[email protected]>
"Apologies if I wasn’t clear enough. I wanted to ask whether I can convert the video into individual frames and then call the chat interface by passing them as multiple images using the image_url field, rather than using the video_url field introduced in this PR. Is this approach feasible?" |
Yes, Qwen2-VL supports both multi-image and video input. |
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Tests should pass now! |
Signed-off-by: DarkLight1337 <[email protected]>
Thank you for your patience. |
Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> Signed-off-by: Isotr0py <[email protected]>
@litianjian Thank you very much for your work. However, currently when video_url is passed in, we cannot control the logic of video frame extraction and image resizing. As a result, we cannot finely control the output result of the video. I am wondering if it is possible to specify a video_process function at the same time when the url is passed in, similar to encode_video when processing base64. After vllm downloads the video, use this function to process the video. The reason for hoping to adopt this method instead of inputting in base64 is that transmitting base64 video occupies too much bandwidth and is prone to network congestion. |
For some models (e.g. Qwen2-VL), You can set |
Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> Signed-off-by: Loc Huynh <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: DarkLight1337 <[email protected]>
If I use multiple images, how many images can input at most, and how can I increase this maximum value? |
Please refer to the |
Online video support for VLMs
vLLM already supports a large number of MultiModal Machine Learning visual models, some of which support image and video input,such as Qwen2-VL, LLaVA-Onevision, etc. Referring to the implementation of image, this proposal adds support for video.
Refer to the visual interfaces of OpenAI (vision and video) and Google Gemini, the visual interface should ideally support inputs from Video URLs and base64.
FIX #9842
Examples