Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online video support for VLMs #10020

Merged
merged 10 commits into from
Nov 7, 2024

Conversation

litianjian
Copy link
Contributor

@litianjian litianjian commented Nov 5, 2024

Online video support for VLMs

vLLM already supports a large number of MultiModal Machine Learning visual models, some of which support image and video input,such as Qwen2-VL, LLaVA-Onevision, etc. Referring to the implementation of image, this proposal adds support for video.

Refer to the visual interfaces of OpenAI (vision and video) and Google Gemini, the visual interface should ideally support inputs from Video URLs and base64.

FIX #9842

Examples

vllm serve llava-hf/llava-onevision-qwen2-7b-ov-hf --served-model-name hello --trust-remote-code
try:
    from decord import VideoReader, cpu
except ImportError:
    pass

import base64
from io import BytesIO
from PIL import Image
import numpy as np
import requests

from openai import OpenAI
import time

openai_api_key = "123"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

def encode_video(video_path, max_frames=80, is_fps_sampling=True):
    if video_path.startswith("http") or video_path.startswith("https"):
        response = requests.get(video_path)
        if response.status_code == 200:
            video_path = BytesIO(response.content)
        else:
            print('failed to load the video')

        vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
        total_frame_num = len(vr)
        if is_fps_sampling:
            # FPS Sampling
            avg_fps = round(vr.get_avg_fps())
            frame_idx = [i for i in range(
                0, total_frame_num, avg_fps)]
            if len(frame_idx) > max_frames:
                uniform_sampled_frames = np.linspace(
                    0, total_frame_num - 1, max_frames, dtype=int
                )
                frame_idx = uniform_sampled_frames.tolist()
            print(frame_idx)
        else:
            # uniform sampling
            if total_frame_num > max_frames:
                uniform_sampled_frames = np.linspace(
                    0, total_frame_num - 1, max_frames, dtype=int
                )
                frame_idx = uniform_sampled_frames.tolist()
            else:
                frame_idx = [i for i in range(0, total_frame_num)]
            print(frame_idx)

        frames = vr.get_batch(frame_idx).asnumpy()
        print("actual frames", len(frames))
        
        base64_frames = []
        for frame in frames:
            img = Image.fromarray(frame)
            output_buffer = BytesIO()
            img.save(output_buffer, format="PNG")

            byte_data = output_buffer.getvalue()
            base64_str = base64.b64encode(byte_data).decode("utf-8")
            base64_frames.append(base64_str)
        return base64_frames


video_url = "https://raw.githubusercontent.com/EvolvingLMMs-Lab/sglang/dev/onevision_local/assets/jobs.mp4"
frames = encode_video(video_url, max_frames=32, is_fps_sampling=False)
images = []
images.extend(frames)
video_base64 = ",".join(images)

chat_response = client.chat.completions.create(
    model="hello",
    temperature=0,
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            # NOTE: The prompt formatting with the video token `<video>` is not needed
            # since the prompt will be processed automatically by the API server.
            {"type": "text", "text": "Please describe the video comprehensively as much as possible."},
            {"type": "video_url", "video_url": {"url": video_url}},
        ],
    }],
)
print("Chat completion output:", chat_response.choices[0].message.content)


chat_response = client.chat.completions.create(
    model="hello", 
    temperature=0,
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            # NOTE: The prompt formatting with the video token `<video>` is not needed
            # since the prompt will be processed automatically by the API server.
            {"type": "text", "text": "Please describe the video comprehensively as much as possible."},
            {"type": "video_url", "video_url": {"url": f"data:video/png;base64,{video_base64}"}},
        ],
    }],
)
print("Chat completion output:", chat_response.choices[0].message.content)

chat_response = client.chat.completions.create(
    model="hello",
    messages=[{
        "role": "user",
        "content": [
            # NOTE: The prompt formatting with the image token `<image>` is not needed
            # since the prompt will be processed automatically by the API server.
            {"type": "text", "text": "What’s in this image?"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{images[0]}"}},
        ],
    }],
)
print("Chat completion output:", chat_response.choices[0].message.content)

Copy link

github-actions bot commented Nov 5, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

vllm/multimodal/utils.py Show resolved Hide resolved
vllm/multimodal/utils.py Show resolved Hide resolved
@mergify mergify bot added documentation Improvements or additions to documentation ci/build labels Nov 5, 2024
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, but please add some tests to verify this.

Copy link

mergify bot commented Nov 6, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @litianjian please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 6, 2024
@litianjian
Copy link
Contributor Author

The code looks good, but please add some tests to verify this.

OK , I have updated the tests.

@mergify mergify bot removed the needs-rebase label Nov 6, 2024
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

@DarkLight1337
Copy link
Member

It looks like the tests failed though, PTAL.

@litianjian
Copy link
Contributor Author

It looks like the tests failed though, PTAL.

The tests succeeded in my local machine.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Nov 7, 2024

It looks like vllm[video] isn't being installed for the test environment. I have updated the dependencies.

@xiaoajie738
Copy link

Excuse me, I have a question to ask, if we want to use vllm to call the qwen2-lv model, when we want to pass the video as input, if I extract the frame and crop it on the client to get the desired video frame sequence, call the chat interface of vllm by passing the picture sequence or call the chat interface by directly passing the video and the sampling frequency and the desired size implemented in this pr, will there be any difference in the results of these two calling methods? I understand that video passing video will bring more network bandwidth pressure and latency, thanks for the answer!

@DarkLight1337
Copy link
Member

Excuse me, I have a question to ask, if we want to use vllm to call the qwen2-lv model, when we want to pass the video as input, if I extract the frame and crop it on the client to get the desired video frame sequence, call the chat interface of vllm by passing the picture sequence or call the chat interface by directly passing the video and the sampling frequency and the desired size implemented in this pr, will there be any difference in the results of these two calling methods? I understand that video passing video will bring more network bandwidth pressure and latency, thanks for the answer!

The HF processor will be called regardless of whether you have done preprocessing beforehand. I am not sure whether HF processor is intelligent enough to return early if the image has already been cropped though.

Signed-off-by: DarkLight1337 <[email protected]>
@xiaoajie738
Copy link

Excuse me, I have a question to ask, if we want to use vllm to call the qwen2-lv model, when we want to pass the video as input, if I extract the frame and crop it on the client to get the desired video frame sequence, call the chat interface of vllm by passing the picture sequence or call the chat interface by directly passing the video and the sampling frequency and the desired size implemented in this pr, will there be any difference in the results of these two calling methods? I understand that video passing video will bring more network bandwidth pressure and latency, thanks for the answer!

The HF processor will be called regardless of whether you have done preprocessing beforehand. I am not sure whether HF processor is intelligent enough to return early if the image has already been cropped though.

"Apologies if I wasn’t clear enough. I wanted to ask whether I can convert the video into individual frames and then call the chat interface by passing them as multiple images using the image_url field, rather than using the video_url field introduced in this PR. Is this approach feasible?"

@DarkLight1337
Copy link
Member

Excuse me, I have a question to ask, if we want to use vllm to call the qwen2-lv model, when we want to pass the video as input, if I extract the frame and crop it on the client to get the desired video frame sequence, call the chat interface of vllm by passing the picture sequence or call the chat interface by directly passing the video and the sampling frequency and the desired size implemented in this pr, will there be any difference in the results of these two calling methods? I understand that video passing video will bring more network bandwidth pressure and latency, thanks for the answer!

The HF processor will be called regardless of whether you have done preprocessing beforehand. I am not sure whether HF processor is intelligent enough to return early if the image has already been cropped though.

"Apologies if I wasn’t clear enough. I wanted to ask whether I can convert the video into individual frames and then call the chat interface by passing them as multiple images using the image_url field, rather than using the video_url field introduced in this PR. Is this approach feasible?"

Yes, Qwen2-VL supports both multi-image and video input.

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 7, 2024
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) November 7, 2024 16:35
@DarkLight1337
Copy link
Member

Tests should pass now!

Signed-off-by: DarkLight1337 <[email protected]>
@DarkLight1337 DarkLight1337 merged commit 28b2877 into vllm-project:main Nov 7, 2024
69 of 70 checks passed
@litianjian
Copy link
Contributor Author

Tests should pass now!

Thank you for your patience.

Isotr0py pushed a commit to Isotr0py/vllm that referenced this pull request Nov 8, 2024
Signed-off-by: DarkLight1337 <[email protected]>
Co-authored-by: litianjian <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
@hujh818
Copy link

hujh818 commented Nov 11, 2024

@litianjian Thank you very much for your work. However, currently when video_url is passed in, we cannot control the logic of video frame extraction and image resizing. As a result, we cannot finely control the output result of the video. I am wondering if it is possible to specify a video_process function at the same time when the url is passed in, similar to encode_video when processing base64. After vllm downloads the video, use this function to process the video. The reason for hoping to adopt this method instead of inputting in base64 is that transmitting base64 video occupies too much bandwidth and is prone to network congestion.

@DarkLight1337
Copy link
Member

For some models (e.g. Qwen2-VL), You can set --mm-processor-kwargs on startup to configure the HF processor class.

JC1DA pushed a commit to JC1DA/vllm that referenced this pull request Nov 11, 2024
Signed-off-by: DarkLight1337 <[email protected]>
Co-authored-by: litianjian <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>
Signed-off-by: Loc Huynh <[email protected]>
rickyyx pushed a commit to rickyyx/vllm that referenced this pull request Nov 13, 2024
Signed-off-by: DarkLight1337 <[email protected]>
Co-authored-by: litianjian <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>
sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024
Signed-off-by: DarkLight1337 <[email protected]>
Co-authored-by: litianjian <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>
Signed-off-by: Sumit Dubey <[email protected]>
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
Signed-off-by: DarkLight1337 <[email protected]>
Co-authored-by: litianjian <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>
mfournioux pushed a commit to mfournioux/vllm that referenced this pull request Nov 20, 2024
Signed-off-by: DarkLight1337 <[email protected]>
Co-authored-by: litianjian <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>
Signed-off-by: Maxime Fournioux <[email protected]>
tlrmchlsmth pushed a commit to neuralmagic/vllm that referenced this pull request Nov 23, 2024
Signed-off-by: DarkLight1337 <[email protected]>
Co-authored-by: litianjian <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024
Signed-off-by: DarkLight1337 <[email protected]>
Co-authored-by: litianjian <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>
@wuxianyess
Copy link

Excuse me, I have a question to ask, if we want to use vllm to call the qwen2-lv model, when we want to pass the video as input, if I extract the frame and crop it on the client to get the desired video frame sequence, call the chat interface of vllm by passing the picture sequence or call the chat interface by directly passing the video and the sampling frequency and the desired size implemented in this pr, will there be any difference in the results of these two calling methods? I understand that video passing video will bring more network bandwidth pressure and latency, thanks for the answer!

The HF processor will be called regardless of whether you have done preprocessing beforehand. I am not sure whether HF processor is intelligent enough to return early if the image has already been cropped though.

"Apologies if I wasn’t clear enough. I wanted to ask whether I can convert the video into individual frames and then call the chat interface by passing them as multiple images using the image_url field, rather than using the video_url field introduced in this PR. Is this approach feasible?"

Yes, Qwen2-VL supports both multi-image and video input.

If I use multiple images, how many images can input at most, and how can I increase this maximum value?

@DarkLight1337
Copy link
Member

Please refer to the --limit-mm-per-prompt engine argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Online video support for VLMs
5 participants