Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SGLang doc user flow updates #703

Merged
merged 19 commits into from
Dec 23, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions docs/shortfin/llm/user/e2e_llama8b_k8s.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# LLama 8b GPU instructions on Kubernetes
saienduri marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also keep this guide general, maybe keep it next to llama_end_to_end.md as llama_serving_on_kubernetes.md, dropping "8B" and "GPU" from the title. Could then also rename llama_end_to_end.md as llama_serving.md? IDK. Naming is hard.

I'm being picky about file names since I want to link to these guides in the release notes, which will then make renaming them later harder without creating 404s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I think we should go with llama_serving_on_kubernetes.md and llama_serving.md. end to end can be confusing to what it entails (especially with the sglang layer on top)


## Setup

We will use an example with `llama_8b_f16` in order to describe the
process of exporting a model and deploying four instances of a shortfin llm server
behind a load balancer on MI300X GPU.

### Pre-Requisites

- Kubernetes cluster available to use
- kubectl installed on system and configured for cluster of interest
- To install kubectl, please check out [kubectl install](https://kubernetes.io/docs/tasks/tools/#kubectl)
and make sure to set the `KUBECONFIG` environment variable to point to your kube config file to authorize
connection to the cluster.

### Deploy shortfin llama app service

Please edit the following file to fetch the correct artifacts and serve the intended configuration of the llama3 model for your use case [here](https://github.com/nod-ai/shark-ai/tree/main/shortfin/python/shortfin_apps/llm/k8s/llama-app-deployment.yaml).

To deploy llama app:

```
kubectl apply -f llama-app-deployment.yaml
saienduri marked this conversation as resolved.
Show resolved Hide resolved
```

To retrieve external IP for targetting the llama app load balancer:

```
kubectl get service shark-llama-app-service
```

Now, you can use the external IP for sglang integration or just sending image generation requests.
saienduri marked this conversation as resolved.
Show resolved Hide resolved

### Delete shortfin llama app service

After done using, make sure to delete:

```
kubectl delete deployment shark-llama-app-deployment
kubectl delete service shark-llama-app-service
```
224 changes: 96 additions & 128 deletions docs/shortfin/llm/user/shortfin_with_sglang_frontend_language.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,21 +24,12 @@ For this tutorial, you will need to meet the following prerequisites:
- You can check out [pyenv](https://github.com/pyenv/pyenv)
as a good tool to be able to manage multiple versions of python
on the same system.
- A running `shortfin` LLM server as described [below](#installstart-shortfin-llm-server)
- A running `shortfin` LLM server. Directions on launching the llm server on one system can be found [here](https://github.com/nod-ai/shark-ai/blob/main/docs/shortfin/llm/user/e2e_llama8b_mi300x.md) and for launching
on a kubernetes cluster, please look [here](https://github.com/nod-ai/shark-ai/blob/main/docs/shortfin/llm/user/e2e_llama8b_k8s.md)
- We will use the shortfin server as the `backend` to generate completions
from SGLang's `frontend language`. In this tutorial, you can think of
`sglang` as the client and `shortfin` as the server.

### Hardware

- This tutorial is designed to run on an [AMD MI300X GPU](https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html)

## Install/Start `shortfin` LLM server

Follow the steps [here](https://github.com/nod-ai/shark-ai/blob/main/docs/shortfin/llm/user/e2e_llama8b_mi300x.md)
to export a model with `sharktank` and start a `shortfin` LLM server
with that model.

## Install sglang

### Install sglang inside of virtual environment
Expand All @@ -48,6 +39,8 @@ We can use pip to install it in the same virtual environment that we used
to start our Shortfin LLM Server.

```bash
python -m venv --prompt shark-ai .venv
source .venv/bin/activate
pip install "git+https://github.com/nod-ai/sglang.git#subdirectory=python"
```

Expand All @@ -56,8 +49,9 @@ pip install "git+https://github.com/nod-ai/sglang.git#subdirectory=python"
You can verify the installation/setup through the following examples:

- [Multi-Turn Q&A Example](#multi-turn-qa-example)
- [Streaming Example](#streaming-example)
- [Fork Example](#fork-example)
- [Benchmark Shortfin](#bench-mark-shortfin-w-sglang-bench_serving-script)
- [Multi-Turn Q&A Batching Example](#multi-turn-qa-batch-example)

## Multi-Turn Q&A example

Expand All @@ -79,57 +73,73 @@ import sglang as sgl

from sglang.lang.chat_template import get_chat_template

backend = sgl.Shortfin(chat_template=get_chat_template("llama-3-instruct"), base_url="http://localhost:8000", ) # Change base_url if running at different address
backend = sgl.Shortfin(chat_template=get_chat_template("llama-3-instruct"), base_url="http://10.158.231.134:80", ) # Change base_url if running at different address
stbaione marked this conversation as resolved.
Show resolved Hide resolved

sgl.set_default_backend(backend)

@sgl.function
def multi_turn_question(s, question_1, question_2):
s += sgl.user(question_1)
s += sgl.assistant(sgl.gen("answer_1", max_tokens=256))
s += sgl.assistant(sgl.gen("answer_1", max_tokens=50))
stbaione marked this conversation as resolved.
Show resolved Hide resolved
s += sgl.user(question_2)
s += sgl.assistant(sgl.gen("answer_2", max_tokens=256))
s += sgl.assistant(sgl.gen("answer_2", max_tokens=50))

state = multi_turn_question.run(question_1="Name the capital city of the USA.", question_2="The Smithsonian is in this location.")

for m in state.messages():
print(m["role"], m["content"])
```

### Shortfin example output
## Streaming Example

You should see an output similar to this:
We can stream our request for a more responsive feel. Let's invoke a `streaming` Q&A from our server:

```text
========== single ==========
```python
import sglang as sgl
from sglang.lang.chat_template import get_chat_template

user : Name the capital city of the USA
assistant : The capital city of the United States of America is Washington, D.C. (short for District of Columbia).
user : The Smithsonian is in this location.
assistant : The Smithsonian Institution is indeed located in Washington, D.C. and is one of the world's largest and most comprehensive museums and research complexes.
```
backend = sgl.Shortfin(chat_template=get_chat_template("llama-3-instruct"), base_url="http://10.158.231.134:80") # Change base_url if running at a different address
stbaione marked this conversation as resolved.
Show resolved Hide resolved

## Fork example
sgl.set_default_backend(backend)

Now that we have sglang installed, we can run an example to show a `fork`
flow with the SGLang [Frontend Language](https://sgl-project.github.io/frontend/frontend.html):
@sgl.function
def multi_turn_question(s, question_1, question_2):
s += sgl.user(question_1)
s += sgl.assistant(sgl.gen("answer_1", max_tokens=50))
s += sgl.user(question_2)
s += sgl.assistant(sgl.gen("answer_2", max_tokens=50))

### Open python interpreter
question_1 = "Name the capital city of the USA."
question_2 = "The Smithsonian is in this location."

```bash
python
# Run the multi-turn question function with streaming enabled
state = multi_turn_question.run(
question_1=question_1,
question_2=question_2,
stream=True,
)

# Collect messages from the streamed output
messages = ""

for chunk in state.text_iter():
messages += chunk

print(messages)
```

### Run example

You can copy and paste the following example into your interpreter:
## Fork example

We can also send different pieces of the same prompt in parallel using the `fork`
flow with the SGLang [Frontend Language](https://sgl-project.github.io/frontend/frontend.html):

```python
import sglang as sgl

from sglang.lang.chat_template import get_chat_template

backend = sgl.Shortfin(chat_template=get_chat_template("llama-3-instruct"), base_url="http://localhost:8000") # Change base_url if running at different address
backend = sgl.Shortfin(chat_template=get_chat_template("llama-3-instruct"), base_url="http://10.158.231.134:80") # Change base_url if running at different address

sgl.set_default_backend(backend)

Expand All @@ -142,7 +152,7 @@ def tip_suggestion(s):
forks = s.fork(2)
for i, f in enumerate(forks):
f += f"Now, expand tip {i+1} into a paragraph:\n"
f += sgl.gen(f"detailed_tip", max_tokens=256, stop="\n\n")
f += sgl.gen(f"detailed_tip", max_tokens=50, stop="\n\n")
s += "Tip 1:" + forks[0]["detailed_tip"] + "\n"
s += "Tip 2:" + forks[1]["detailed_tip"] + "\n"
s += "In summary" + sgl.gen("summary")
Expand All @@ -152,103 +162,61 @@ state = tip_suggestion.run()
print(state.text())
```

### Shortfin example output

You should see an output similar to this:

```text
Here are two tips for staying healthy: 1. Balanced Diet. 2. Regular Exercise.

Tip 1:A balanced diet is important for maintaining good health. It should
include a variety of foods from all the major food groups, such as fruits,
vegetables, grains, proteins, and dairy. Eating a balanced diet can help
prevent chronic diseases such as heart disease, diabetes, and obesity.

Now, expand tip 2 into a paragraph:
Regular exercise is also important for maintaining good health. It can help
improve cardiovascular health, strengthen muscles and bones, and reduce the
risk of chronic diseases. Exercise can also help improve mental health by
reducing stress and anxiety. It is recommended that adults get at least 150
minutes of moderate-intensity exercise or 75 minutes of vigorous-intensity
exercise per week.

Now, combine the two paragraphs into a single paragraph:
A balanced diet and regular exercise are both important for maintaining good
health. A balanced diet should include a variety of foods from all the major
food groups, such as fruits, vegetables, grains, proteins, and dairy.
Eating a balanced diet can help prevent chronic diseases such as heart disease,
diabetes, and obesity. Regular exercise is also important for maintaining good
health. It can help improve cardiovascular health, strengthen muscles and bones,
and reduce the risk of chronic diseases. Exercise can also help improve mental
health by reducing stress and anxiety. It is recommended that

Tip 2:Regular exercise is important for maintaining a healthy body and mind.
It can help improve cardiovascular health, strengthen muscles and bones,
and reduce the risk of chronic diseases such as diabetes and heart disease.
Additionally, exercise has been shown to improve mood, reduce stress,
and increase overall well-being. It is recommended that adults engage in
at least 150 minutes of moderate-intensity aerobic activity or 75 minutes of
vigorous-intensity aerobic activity per week, as well as strength training
exercises at least two days per week.

In summary, a balanced diet and regular exercise are both essential for
maintaining good health. A balanced diet should include a variety of foods from
all the major food groups, while regular exercise can help improve
cardiovascular health, strengthen muscles and bones, reduce the risk of
chronic diseases, and improve mental health. It is recommended that adults
engage in at least 150 minutes of moderate-intensity aerobic activity or
75 minutes of vigorous-intensity aerobic activity per week,
as well as strength training exercises at least two days per week.
```
## Multi-Turn Q&A Batch Example

With **Shortfin** + SGLang, we can also easily send requests as a batch.
Let's now invoke a `batched` Q&A flow with the SGLang [Batching](https://sgl-project.github.io/frontend/frontend.html#batching):

```python
import sglang as sgl
from sglang.lang.chat_template import get_chat_template

## Benchmark shortfin w/ sglang `bench_serving` script
# Initialize the backend with the specified chat template and base URL
backend = sgl.Shortfin(chat_template=get_chat_template("llama-3-instruct"), base_url="http://10.158.231.134:80") # Change base_url if running at a different address

We can obtain benchmarking metrics using the `bench_serving` script
provided by SGLang:
# Set the default backend for sglang
sgl.set_default_backend(backend)

**NOTE: Change `--base-url` if running at a different address**
@sgl.function
def multi_turn_question(s, question_1, question_2):
s += sgl.user(question_1)
s += sgl.assistant(sgl.gen("answer_1", max_tokens=50))
s += sgl.user(question_2)
s += sgl.assistant(sgl.gen("answer_2", max_tokens=50))

# Define the questions for the first and second sets
question_1_1 = "Name the capital city of the USA."
question_1_2 = "The Smithsonian is in this location."
question_2_1 = "Name the largest city in the USA."
question_2_2 = "The Empire State Building is in this location."

# Run the multi-turn question function in batch mode
states = multi_turn_question.run_batch(
[
{
"question_1": question_1_1,
"question_2": question_1_2,
},
{
"question_1": question_2_1,
"question_2": question_2_2,
},
]
)

# Extract responses from the states
first_qa = states[0]
second_qa = states[1]

first_qa_messages = first_qa.messages()
second_qa_messages = second_qa.messages()

# Print messages from the first QA session
for m in first_qa_messages:
print(m["role"], m["content"])

```bash
python -m sglang.bench_serving --backend shortfin --num-prompt 10 --base-url http://localhost:8000 --tokenizer /path/to/tokenizer/dir --request-rate 1
```
# Print messages from the second QA session
for m in second_qa_messages:
print(m["role"], m["content"])

There are some more metrics captured, but the most relevant are the following:

- E2E Latency
- TTFT (Time to First Token)
- TPOT (Time per Output Token)
- ITL (Inter-Token Latency)
- Request Throughput
- Benchmark Duration

When complete, you should see an output similar to this:

```text
============ Serving Benchmark Result ============
Backend: shortfin
Traffic request rate: 1.0
Successful requests: 10
Benchmark duration (s): 427.91
Total input tokens: 1960
Total generated tokens: 2774
Total generated tokens (retokenized): 63
Request throughput (req/s): 0.02
Input token throughput (tok/s): 4.58
Output token throughput (tok/s): 6.48
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 416268.77
Median E2E Latency (ms): 417159.14
---------------Time to First Token----------------
Mean TTFT (ms): 292404.29
Median TTFT (ms): 365989.01
P99 TTFT (ms): 367325.63
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 1359.41
Median TPOT (ms): 163.96
P99 TPOT (ms): 6316.12
---------------Inter-token Latency----------------
Mean ITL (ms): 2238.99
Median ITL (ms): 958.75
P99 ITL (ms): 2719.50
==================================================
```
Loading
Loading