-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] VLM preprocessor hashing #11020
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
This pull request has merge conflicts that must be resolved before it can be |
545a40a
to
3554439
Compare
/ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Just some comments for style. Also please
- Revert unnecessary changes in the example code.
- Add some unit tests.
vllm/v1/engine/processor.py
Outdated
mm_hashes = self.mm_hasher.hash(decoder_inputs.multi_modal_data) \ | ||
if self.mm_hasher is not None else None | ||
|
||
mm_inputs, mm_hashes = self.mm_input_mapper_client.process_inputs( | ||
decoder_inputs.multi_modal_data, mm_hashes, | ||
decoder_inputs.mm_processor_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably need some assertions to make sure mm_hasher and mm_input_mapper_client are not None.
@comaniac thanks for the quick review! I actually realized that I need to push the hash and the cache into the _process_multimodal(..) due to recent merging of preprocessor for llava. Also it caches the first preprocessor too. Will send changes tomorrow. |
This pull request has merge conflicts that must be resolved before it can be |
0e26569
to
f4466ba
Compare
0c69c14
to
7304434
Compare
@comaniac code is ready for re-review. Eventually, I did not push the hasher/cacher into the _process_multimodal(..) since it is also used to compute the MM placeholders, and there is no simple way to skip it due to hash HIT. We can look into this as a follow up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just some nits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does these changes compatible with v0? If not then we should differentiate them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it should not matter if it is v0 or v1, since it simply controls the images used.
185655c
to
6a7b37a
Compare
9ec0c15
to
431f5d4
Compare
Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: Alexander Matveev <[email protected]>
431f5d4
to
adda9d4
Compare
Signed-off-by: Alexander Matveev <[email protected]>
Signed-off-by: Roger Wang <[email protected]> Signed-off-by: Alexander Matveev <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Roger Wang <[email protected]> Signed-off-by: Alexander Matveev <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]> Signed-off-by: Alexander Matveev <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Roger Wang <[email protected]>
This PR adds:
The idea of MM preprocessor caching is based on having a client and a server, where the client executes in the frontend process (=P0) and the server in the core process (=P1).
The caching for both client and server is mirrored/similar, and this allows us to avoid the serialization of "mm_inputs" (like pixel values) between client (=P0) and server (=P1) processes.
Currently, the MM preprocessor caching is disabled by default, since we did not finish the performance analysis yet. We still need to enable encoder caching and prefix caching end-to-end.
Follow up PRs: