Skip to content

Commit

Permalink
Merge branch 'main' into bump-iree-3.1.0rc20241220
Browse files Browse the repository at this point in the history
  • Loading branch information
ScottTodd authored Jan 2, 2025
2 parents 8f49804 + 56f3d21 commit ff20f15
Show file tree
Hide file tree
Showing 22 changed files with 1,138 additions and 291 deletions.
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ repos:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
args: ['--allow-multiple-documents']
- id: check-added-large-files
- repo: https://github.com/psf/black
rev: 22.10.0
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -272,3 +272,32 @@ If you want to find the process again:
```bash
ps -f | grep shortfin
```

## Server Options

To run the server with different options, you can use the
following command to see the available flags:

```bash
python -m shortfin_apps.llm.server --help
```

### Server Options

A full list of options can be found below:

| Argument | Description |
| ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `--host HOST` | Specify the host to bind the server. |
| `--port PORT` | Specify the port to bind the server. |
| `--root-path ROOT_PATH` | Root path to use for installing behind a path-based proxy. |
| `--timeout-keep-alive TIMEOUT_KEEP_ALIVE` | Keep-alive timeout duration. |
| `--tokenizer_json TOKENIZER_JSON` | Path to a `tokenizer.json` file. |
| `--tokenizer_config_json TOKENIZER_CONFIG_JSON` | Path to a `tokenizer_config.json` file. |
| `--model_config MODEL_CONFIG` | Path to the model config file. |
| `--vmfb VMFB` | Model [VMFB](https://iree.dev/developers/general/developer-tips/#inspecting-vmfb-files) to load. |
| `--parameters [FILE ...]` | Parameter archives to load (supports: `gguf`, `irpa`, `safetensors`). |
| `--device {local-task,hip,amdgpu}` | Device to serve on (e.g., `local-task`, `hip`). Same options as [iree-run-module --list_drivers](https://iree.dev/guides/deployment-configurations/gpu-rocm/#get-the-iree-runtime). |
| `--device_ids [DEVICE_IDS ...]` | Device IDs visible to the system builder. Defaults to None (full visibility). Can be an index or a device ID like `amdgpu:0:0@0`. |
| `--isolation {none,per_fiber,per_call}` | Concurrency control: How to isolate programs. |
| `--amdgpu_async_allocations` | Enable asynchronous allocations for AMD GPU device contexts. |
44 changes: 44 additions & 0 deletions docs/shortfin/llm/user/llama_serving_on_kubernetes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Llama 8b GPU instructions on Kubernetes

## Setup

We will use an example with `llama_8b_f16` in order to describe the
process of exporting a model and deploying four instances of a shortfin llm server
behind a load balancer on MI300X GPU.

### Pre-Requisites

- Kubernetes cluster available to use
- kubectl installed on system and configured for cluster of interest
- To install kubectl, please check out [kubectl install](https://kubernetes.io/docs/tasks/tools/#kubectl)
and make sure to set the `KUBECONFIG` environment variable to point to your kube config file to authorize
connection to the cluster.

### Deploy shortfin llama app service

To generate the artifacts required for this k8s deployment, please follow [llama_serving.md](./llama_serving.md) until you have have all of the files that we need to run the shortfin LLM server.
Please upload your artifacts to a storage option that you can pull from in your k8s cluster (NFS, S3, CSP).
Save [llama-app-deployment.yaml](../../../../shortfin/deployment/shortfin_apps/llm/k8s/llama-app-deployment.yaml) locally and edit it to include your artifacts you just stored and change flags to intended configuration.

To deploy llama app:

```
kubectl apply -f llama-app-deployment.yaml
```

To retrieve external IP for targetting the llama app load balancer:

```
kubectl get service shark-llama-app-service
```

Now, you can use the external IP for sglang integration or just sending text generation requests.

### Delete shortfin llama app service

After done using, make sure to delete:

```
kubectl delete deployment shark-llama-app-deployment
kubectl delete service shark-llama-app-service
```
Loading

0 comments on commit ff20f15

Please sign in to comment.