Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[V1] Multiprocessing Tensor Parallel Support for v1 #9856

Merged
merged 68 commits into from
Dec 10, 2024

Conversation

tlrmchlsmth
Copy link
Collaborator

@tlrmchlsmth tlrmchlsmth commented Oct 30, 2024

Implementation of tensor parallel support for V1.

Some differences from V0:

  • The executor creates N worker processes rather than N-1 processes like in V0
  • All processes run prepare_inputs.
  • All processes run the sampler. This means that all workers need the logits so the logits_processor has an AllGather instead of a Gather operation.
  • The executor broadcasts the scheduler output to the workers and one of them sends back the model runner output -- both of these IPCs use shared memory message queues.
  • The workers sit in a very tight model execution loop where they only handle model execution and process termination.

Some benchmarks are below, using python 3.12. I'm running Llama3-8B on 2 A100s, which will really make any overheads shine.

===================
main
===================

python3 benchmarks/benchmark_latency.py --model meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 2 \
   --num-iters-warmup 5 --num-iters 20  --batch-size 8 --input-len 128 --output-len 256

Avg latency: 2.301660185540095 seconds
10% percentile latency: 2.2838743404485284 seconds
25% percentile latency: 2.2875737231224775 seconds
50% percentile latency: 2.291928739286959 seconds
75% percentile latency: 2.2938467713538557 seconds
90% percentile latency: 2.3024177022278307 seconds
99% percentile latency: 2.459673847686499 seconds

===================
This PR
===================

VLLM_USE_V1=1 python3 benchmarks/benchmark_latency.py --model meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 2 \
   --num-iters-warmup 5 --num-iters 20  --batch-size 8 --input-len 128 --output-len 256

Avg latency: 2.293809377681464 seconds
10% percentile latency: 2.288416426535696 seconds
25% percentile latency: 2.2926345141604543 seconds
50% percentile latency: 2.293652622960508 seconds
75% percentile latency: 2.295131590683013 seconds
90% percentile latency: 2.299927053321153 seconds
99% percentile latency: 2.300706419609487 seconds

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@njhill njhill self-requested a review October 31, 2024 00:15
@WoosukKwon
Copy link
Collaborator

@tlrmchlsmth Thanks for the great work! Let me know when it is ready for review.

Copy link

mergify bot commented Nov 1, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @tlrmchlsmth please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@njhill
Copy link
Member

njhill commented Nov 11, 2024

@tlrmchlsmth this looks like a great start and is in the right direction imo. I have been thinking a lot though about how to streamline some things in V1... going to dump some of those ideas here and we can discuss more next week.

  • I think we can better unify the single vs multi-GPU logic/abstractions. And actually I think @WoosukKwon's hope is to keep the worker in a separate process even for single GPU, so that we always have n>=1 worker processes.
  • We currently have (imo) quite a convoluted layering w.r.t. distributed communication. Initially the idea was to separate control and data planes, but for performance we ended up with this situation where the control plane is an event loop that kicks the workers into a separate loop when executing requests ... in both cases they are waiting on a queue for broadcast messages and I feel it would be simpler to have a single event loop.
  • After broadcasting, our core engine loop can perform other CPU tasks while waiting for the response from the rank 0 worker - for example serialization/deserialization of messages to/from the "front-end" process.
  • In fact it should be fairly simple (unless I'm missing something) to decouple the data flows to/from the workers and make the scheduling async, so that they always have the schedule for the next step waiting when they finish the current one. We can update and broadcast the schedule with the latest added/removed requests for the (n+2)th step soon after receiving the sampler output for the nth step (i.e. while the (n+1)th step is underway).
  • We should use msgpack/msgspec instead of pickle for all these IPC messages.
  • I think the current MessageQueue impl (using shm and zmq) would need some adjustments to support the above, such as not using pickle. Also I don't think we would want to use shm for receiving messages in the central proc since it involves spinning. We could use it in the workers to receive the broadcasts but if we are doing async scheduling then it might actually be better to use zmq for a similar reason - and then it will be easier to overlap the I/O and serialization/deserialization with the forward pass by running it in a separate thread.

Mostly orthogonal to the above, I also feel we should rethink the executor abstraction / class hierarchy to better isolate the accelerator-specific aspects. There's a significant amount of duplicated logic right now across different executors (and even workers) and I think we could unify/consolidate a lot of that.

Copy link

mergify bot commented Nov 11, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tlrmchlsmth.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 11, 2024
@youkaichao youkaichao self-assigned this Nov 12, 2024
@mergify mergify bot removed the needs-rebase label Nov 13, 2024
@njhill njhill mentioned this pull request Nov 14, 2024
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
@tlrmchlsmth
Copy link
Collaborator Author

I think we can better unify the single vs multi-GPU logic/abstractions. And actually I think @WoosukKwon's hope is to keep the worker in a separate process even for single GPU, so that we always have n>=1 worker processes.

Do folks think I should go ahead with this in the current PR? I know @youkaichao had some concerns about ease of debugging if we always put the worker in a separate process.

Not sure I see a benefit for this implementation, beyond consolidating the implementation and reducing code size. For asynchronous scheduling, it makes a whole lot of sense.

@njhill
Copy link
Member

njhill commented Nov 22, 2024

I think we can better unify the single vs multi-GPU logic/abstractions. And actually I think @WoosukKwon's hope is to keep the worker in a separate process even for single GPU, so that we always have n>=1 worker processes.

Do folks think I should go ahead with this in the current PR? I know @youkaichao had some concerns about ease of debugging if we always put the worker in a separate process.

Not sure I see a benefit for this implementation, beyond consolidating the implementation and reducing code size. For asynchronous scheduling, it makes a whole lot of sense.

@tlrmchlsmth I think it's fine to defer it.

Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Copy link

mergify bot commented Nov 23, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tlrmchlsmth.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 23, 2024
Signed-off-by: Tyler Michael Smith <[email protected]>
@tlrmchlsmth
Copy link
Collaborator Author

Switched initialization over to use collective_rpc. Easy now that we've made it general

vllm/platforms/cuda.py Outdated Show resolved Hide resolved
Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the great efforts! the pr looks good to me in general, I left several new comments, but they should be easy to address.

I do want to discuss one more thing about the initialization process tomorrow.

vllm/platforms/cuda.py Outdated Show resolved Hide resolved
Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the great work! I don't have major concern now, there are some followup work, but we don't necessarily need to do it in this pr.

before merge, please fix the nit comments about vllm.envs , and change the signature of collective_rpc to collective_rpc(self, method, timeout, *args, **kwargs), thanks.

- update collective_rpc interface
- change run_on_both_enginges pytorch fixture

Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
@youkaichao
Copy link
Member

LGTM to merge now.

@tlrmchlsmth tlrmchlsmth enabled auto-merge (squash) December 10, 2024 04:14
@tlrmchlsmth tlrmchlsmth merged commit 28b3a1c into vllm-project:main Dec 10, 2024
57 checks passed
@youkaichao youkaichao deleted the tms/v1_tp branch December 10, 2024 06:29
sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024
BKitor pushed a commit to BKitor/vllm that referenced this pull request Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants