[V1] Multiprocessing Tensor Parallel Support for v1 #9856

tlrmchlsmth · 2024-10-30T21:08:59Z

Implementation of tensor parallel support for V1.

Some differences from V0:

The executor creates N worker processes rather than N-1 processes like in V0
All processes run prepare_inputs.
All processes run the sampler. This means that all workers need the logits so the logits_processor has an AllGather instead of a Gather operation.
The executor broadcasts the scheduler output to the workers and one of them sends back the model runner output -- both of these IPCs use shared memory message queues.
The workers sit in a very tight model execution loop where they only handle model execution and process termination.

Some benchmarks are below, using python 3.12. I'm running Llama3-8B on 2 A100s, which will really make any overheads shine.

===================
main
===================

python3 benchmarks/benchmark_latency.py --model meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 2 \
   --num-iters-warmup 5 --num-iters 20  --batch-size 8 --input-len 128 --output-len 256

Avg latency: 2.301660185540095 seconds
10% percentile latency: 2.2838743404485284 seconds
25% percentile latency: 2.2875737231224775 seconds
50% percentile latency: 2.291928739286959 seconds
75% percentile latency: 2.2938467713538557 seconds
90% percentile latency: 2.3024177022278307 seconds
99% percentile latency: 2.459673847686499 seconds

===================
This PR
===================

VLLM_USE_V1=1 python3 benchmarks/benchmark_latency.py --model meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 2 \
   --num-iters-warmup 5 --num-iters 20  --batch-size 8 --input-len 128 --output-len 256

Avg latency: 2.293809377681464 seconds
10% percentile latency: 2.288416426535696 seconds
25% percentile latency: 2.2926345141604543 seconds
50% percentile latency: 2.293652622960508 seconds
75% percentile latency: 2.295131590683013 seconds
90% percentile latency: 2.299927053321153 seconds
99% percentile latency: 2.300706419609487 seconds

github-actions · 2024-10-30T21:09:12Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

WoosukKwon · 2024-10-31T22:15:58Z

@tlrmchlsmth Thanks for the great work! Let me know when it is ready for review.

mergify · 2024-11-01T20:53:17Z

This pull request has merge conflicts that must be resolved before it can be
merged. @tlrmchlsmth please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

njhill · 2024-11-11T01:18:02Z

@tlrmchlsmth this looks like a great start and is in the right direction imo. I have been thinking a lot though about how to streamline some things in V1... going to dump some of those ideas here and we can discuss more next week.

I think we can better unify the single vs multi-GPU logic/abstractions. And actually I think @WoosukKwon's hope is to keep the worker in a separate process even for single GPU, so that we always have n>=1 worker processes.
We currently have (imo) quite a convoluted layering w.r.t. distributed communication. Initially the idea was to separate control and data planes, but for performance we ended up with this situation where the control plane is an event loop that kicks the workers into a separate loop when executing requests ... in both cases they are waiting on a queue for broadcast messages and I feel it would be simpler to have a single event loop.
After broadcasting, our core engine loop can perform other CPU tasks while waiting for the response from the rank 0 worker - for example serialization/deserialization of messages to/from the "front-end" process.
In fact it should be fairly simple (unless I'm missing something) to decouple the data flows to/from the workers and make the scheduling async, so that they always have the schedule for the next step waiting when they finish the current one. We can update and broadcast the schedule with the latest added/removed requests for the (n+2)th step soon after receiving the sampler output for the nth step (i.e. while the (n+1)th step is underway).
We should use msgpack/msgspec instead of pickle for all these IPC messages.
I think the current MessageQueue impl (using shm and zmq) would need some adjustments to support the above, such as not using pickle. Also I don't think we would want to use shm for receiving messages in the central proc since it involves spinning. We could use it in the workers to receive the broadcasts but if we are doing async scheduling then it might actually be better to use zmq for a similar reason - and then it will be easier to overlap the I/O and serialization/deserialization with the forward pass by running it in a separate thread.

Mostly orthogonal to the above, I also feel we should rethink the executor abstraction / class hierarchy to better isolate the accelerator-specific aspects. There's a significant amount of duplicated logic right now across different executors (and even workers) and I think we could unify/consolidate a lot of that.

mergify · 2024-11-11T23:07:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tlrmchlsmth.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/distributed/device_communicators/shm_broadcast.py

Signed-off-by: Tyler Michael Smith <[email protected]>

tlrmchlsmth · 2024-11-22T19:12:48Z

I think we can better unify the single vs multi-GPU logic/abstractions. And actually I think @WoosukKwon's hope is to keep the worker in a separate process even for single GPU, so that we always have n>=1 worker processes.

Do folks think I should go ahead with this in the current PR? I know @youkaichao had some concerns about ease of debugging if we always put the worker in a separate process.

Not sure I see a benefit for this implementation, beyond consolidating the implementation and reducing code size. For asynchronous scheduling, it makes a whole lot of sense.

Signed-off-by: Tyler Michael Smith <[email protected]>

njhill · 2024-11-22T19:55:41Z

I think we can better unify the single vs multi-GPU logic/abstractions. And actually I think @WoosukKwon's hope is to keep the worker in a separate process even for single GPU, so that we always have n>=1 worker processes.

Do folks think I should go ahead with this in the current PR? I know @youkaichao had some concerns about ease of debugging if we always put the worker in a separate process.

Not sure I see a benefit for this implementation, beyond consolidating the implementation and reducing code size. For asynchronous scheduling, it makes a whole lot of sense.

@tlrmchlsmth I think it's fine to defer it.

vllm/distributed/device_communicators/shm_broadcast.py

Signed-off-by: Tyler Michael Smith <[email protected]>

mergify · 2024-11-23T01:17:23Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tlrmchlsmth.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Tyler Michael Smith <[email protected]>

tlrmchlsmth · 2024-12-07T18:06:23Z

Switched initialization over to use collective_rpc. Easy now that we've made it general

vllm/model_executor/layers/logits_processor.py

vllm/platforms/cuda.py

vllm/v1/executor/abstract.py

vllm/v1/executor/multiproc_executor.py

vllm/v1/worker/gpu_worker.py

youkaichao

thanks for the great efforts! the pr looks good to me in general, I left several new comments, but they should be easy to address.

I do want to discuss one more thing about the initialization process tomorrow.

Signed-off-by: Tyler Michael Smith <[email protected]>

vllm/distributed/device_communicators/shm_broadcast.py

vllm/model_executor/layers/logits_processor.py

vllm/platforms/cuda.py

vllm/v1/executor/multiproc_executor.py

youkaichao

thanks for the great work! I don't have major concern now, there are some followup work, but we don't necessarily need to do it in this pr.

before merge, please fix the nit comments about vllm.envs , and change the signature of collective_rpc to collective_rpc(self, method, timeout, *args, **kwargs), thanks.

- update collective_rpc interface - change run_on_both_enginges pytorch fixture Signed-off-by: Tyler Michael Smith <[email protected]>

Signed-off-by: Tyler Michael Smith <[email protected]>

youkaichao · 2024-12-10T04:04:19Z

LGTM to merge now.

Signed-off-by: Tyler Michael Smith <[email protected]>

njhill self-requested a review October 31, 2024 00:15

mergify bot added the needs-rebase label Nov 1, 2024

tlrmchlsmth mentioned this pull request Nov 4, 2024

[Core] Use os.sched_yield in ShmRingBuffer instead of time.sleep #9994

Merged

mergify bot removed the needs-rebase label Nov 4, 2024

tlrmchlsmth mentioned this pull request Nov 8, 2024

[V1] AsyncLLM Implementation #9826

Merged

5 tasks

mergify bot added the needs-rebase label Nov 11, 2024

youkaichao self-assigned this Nov 12, 2024

mergify bot removed the needs-rebase label Nov 13, 2024

njhill mentioned this pull request Nov 14, 2024

[WIP] Ray Backend V1 #10006

Closed

russellb reviewed Nov 18, 2024

View reviewed changes

vllm/distributed/device_communicators/shm_broadcast.py Outdated Show resolved Hide resolved

vllm/distributed/device_communicators/shm_broadcast.py Outdated Show resolved Hide resolved

tlrmchlsmth force-pushed the tms/v1_tp branch from 12b8813 to 34ca6bb Compare November 22, 2024 18:45

tlrmchlsmth added 2 commits November 22, 2024 13:54

initial v1 tp support

5ad9c60

Signed-off-by: Tyler Michael Smith <[email protected]>

V1 TP with zmq-based boostrapping

49869fa

Signed-off-by: Tyler Michael Smith <[email protected]>

tlrmchlsmth force-pushed the tms/v1_tp branch from 34ca6bb to 49869fa Compare November 22, 2024 18:54

tlrmchlsmth added 3 commits November 22, 2024 14:22

improve check for USE_SCHED_YIELD

71e08aa

Signed-off-by: Tyler Michael Smith <[email protected]>

Merge branch 'main' into tms/v1_tp

4930246

fixup

3ea0cae

Signed-off-by: Tyler Michael Smith <[email protected]>

njhill reviewed Nov 22, 2024

View reviewed changes

vllm/distributed/device_communicators/shm_broadcast.py Outdated Show resolved Hide resolved

tlrmchlsmth added 2 commits November 22, 2024 16:07

workers must be daemonic

d4b55ae

Signed-off-by: Tyler Michael Smith <[email protected]>

We can now terminate properly

feeed73

Signed-off-by: Tyler Michael Smith <[email protected]>

mergify bot added the needs-rebase label Nov 23, 2024

Merge branch 'main' into tms/v1_tp

e3c9c5c

Signed-off-by: Tyler Michael Smith <[email protected]>

Merge branch 'main' into tms/v1_tp: instance id

50a12bc

Signed-off-by: Tyler Michael Smith <[email protected]>

tlrmchlsmth force-pushed the tms/v1_tp branch from 9f561ac to 50a12bc Compare December 7, 2024 18:04