Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SGLang doc user flow updates #703

Merged
merged 19 commits into from
Dec 23, 2024
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ repos:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
args: ['--allow-multiple-documents']
- id: check-added-large-files
- repo: https://github.com/psf/black
rev: 22.10.0
Expand Down
44 changes: 44 additions & 0 deletions docs/shortfin/llm/user/llama_serving_on_kubernetes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Llama 8b GPU instructions on Kubernetes

## Setup

We will use an example with `llama_8b_f16` in order to describe the
process of exporting a model and deploying four instances of a shortfin llm server
behind a load balancer on MI300X GPU.

### Pre-Requisites

- Kubernetes cluster available to use
- kubectl installed on system and configured for cluster of interest
- To install kubectl, please check out [kubectl install](https://kubernetes.io/docs/tasks/tools/#kubectl)
and make sure to set the `KUBECONFIG` environment variable to point to your kube config file to authorize
connection to the cluster.

### Deploy shortfin llama app service

To generate the artifacts required for this k8s deployment, please follow [llama_serving.md](./llama_serving.md) until you have have all of the files that we need to run the shortfin LLM server.
Please upload your artifacts to a storage option that you can pull from in your k8s cluster (NFS, S3, CSP).
Save [llama-app-deployment.yaml](../../../../shortfin/deployment/shortfin_apps/llm/k8s/llama-app-deployment.yaml) locally and edit it to include your artifacts you just stored and change flags to intended configuration.

To deploy llama app:

```
kubectl apply -f llama-app-deployment.yaml
```

To retrieve external IP for targetting the llama app load balancer:

```
kubectl get service shark-llama-app-service
```

Now, you can use the external IP for sglang integration or just sending text generation requests.

### Delete shortfin llama app service

After done using, make sure to delete:

```
kubectl delete deployment shark-llama-app-deployment
kubectl delete service shark-llama-app-service
```
Loading
Loading