-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SGLang doc user flow updates #703
Conversation
stbaione
commented
Dec 16, 2024
•
edited by saienduri
Loading
edited by saienduri
- Expand examples in user docs to cover all supported features
- Default docs to targeting cluster application
- General cleanup and removal of unnecessary text
- Add k8s instructions for shortfin deployment
Expand examples for targeting shortfin from sglang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we should have steps for generating artifacts before launching shortfin/sglang and cluster: steps for downloading gguf or safetensor, ingesting it using sharktank and exporting mlir and compiling vmfb and then launch shortfin with artifacts -- in that sequence
shortfin/python/shortfin_apps/llm/k8s/llama-app-deployment.yaml
Outdated
Show resolved
Hide resolved
@kumardeepakamd Should we have instructions here, or link to the shortfin user docs? If we link to shortfin user docs, it'll stay consistent as we make updates over there. Otherwise we'll have to reflect updates in both places |
Whatever is easier to maintain is fine. Your call.
…On Wed, Dec 18, 2024, 3:01 PM Stephen Baione ***@***.***> wrote:
So, we should have steps for generating artifacts before launching
shortfin/sglang and cluster: steps for downloading gguf or safetensor,
ingesting it using sharktank and exporting mlir and compiling vmfb and then
launch shortfin with artifacts -- in that sequence
@kumardeepakamd <https://github.com/kumardeepakamd> Should we have
instructions here, or link to the shortfin user docs? If we link to
shortfin user docs, it'll stay consistent as we make updates over there.
Otherwise we'll have to reflect updates in both places
—
Reply to this email directly, view it on GitHub
<#703 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/A5OMX35FNJ4VLNR3NUFDYUD2GH5DFAVCNFSM6AAAAABTWYWVMGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJSGQZDQNBVGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Add iree-base-runtime, iree-base-compiler and iree-turbine to nightly install instructions
Made links to existing docs for shortfin more visible, updated install instructions on Shortfin 8b user docs |
docs/shortfin/llm/user/shortfin_with_sglang_frontend_language.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few nits to clean up the example code samples.
docs/shortfin/llm/user/shortfin_with_sglang_frontend_language.md
Outdated
Show resolved
Hide resolved
docs/shortfin/llm/user/shortfin_with_sglang_frontend_language.md
Outdated
Show resolved
Hide resolved
docs/shortfin/llm/user/shortfin_with_sglang_frontend_language.md
Outdated
Show resolved
Hide resolved
shortfin/deployment/shortfin_apps/llm/k8s/llama-app-deployment.yaml
Outdated
Show resolved
Hide resolved
# update to your artifacts and change cli flags for instantiation of server to match your intended llama configuration | ||
args: | ||
- | | ||
sudo apt update && | ||
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash && | ||
sudo apt install git -y && | ||
sudo apt install python3.11 python3.11-dev python3.11-venv -y && | ||
sudo apt-get install wget -y && | ||
python3.11 -m venv shark_venv && source shark_venv/bin/activate && | ||
mkdir shark_artifacts && | ||
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/config.json -O shark_artifacts/config.json && | ||
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/meta-llama-3.1-8b-instruct.f16.gguf -O shark_artifacts/meta-llama-3.1-8b-instruct.f16.gguf && | ||
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/model.vmfb -O shark_artifacts/model.vmfb && | ||
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/tokenizer_config.json -O shark_artifacts/tokenizer_config.json && | ||
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/tokenizer.json -O shark_artifacts/tokenizer.json && | ||
pip install --pre shortfin[apps] -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels && | ||
pip install pandas && | ||
python -m shortfin_apps.llm.server --tokenizer_json=shark_artifacts/tokenizer.json --model_config=shark_artifacts/config.json --vmfb=shark_artifacts/model.vmfb --parameters=shark_artifacts/meta-llama-3.1-8b-instruct.f16.gguf --device_ids 0 --device=hip; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These files should come from huggingface and we shouldn't use a precompiled .vmfb file, we should use sharktank + iree-compile.
Maybe file an issue for now to generalize this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here, I don't think we want to put the whole flow as part of the kubernetes instantiation. It would require each instance to pull down a 100+ GB gguf and go through the whole flow which isn't really ideal at scale. Instead, we expect the users to follow the llama_serving.md
and generate artifacts that they push somewhere (NFS, S3, CSP) and pull down on each instantiation. I can make this more clearer with the docs, but why I put configure with your own artifacts in the docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. We can provide an example, but if we expect people to run this then it should be really clear what they are expected to fork/edit and what is shared.
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/model.vmfb -O shark_artifacts/model.vmfb && | ||
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/tokenizer_config.json -O shark_artifacts/tokenizer_config.json && | ||
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_8b/tokenizer.json -O shark_artifacts/tokenizer.json && | ||
pip install --pre shortfin[apps] -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should point users at stable releases. Could have the file using nightly releases until we push 3.1.0 though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I can update when we push 3.1.0
shortfin/deployment/shortfin_apps/llm/k8s/llama-app-deployment.yaml
Outdated
Show resolved
Hide resolved
Co-authored-by: Scott Todd <[email protected]>
…s BASE_URL in code examples
docs/shortfin/llm/user/shortfin_with_sglang_frontend_language.md
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,42 @@ | |||
# LLama 8b GPU instructions on Kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also keep this guide general, maybe keep it next to llama_end_to_end.md
as llama_serving_on_kubernetes.md
, dropping "8B" and "GPU" from the title. Could then also rename llama_end_to_end.md
as llama_serving.md
? IDK. Naming is hard.
I'm being picky about file names since I want to link to these guides in the release notes, which will then make renaming them later harder without creating 404s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I think we should go with llama_serving_on_kubernetes.md
and llama_serving.md
. end to end can be confusing to what it entails (especially with the sglang layer on top)
Co-authored-by: Scott Todd <[email protected]>
Co-authored-by: Scott Todd <[email protected]>
…ang`, Added `Server Options` to `llama_serving.md`, detailing server flags
- Expand examples in user docs to cover all supported features - Default docs to targeting cluster application - General cleanup and removal of unnecessary text - Add k8s instructions for shortfin deployment --------- Co-authored-by: saienduri <[email protected]> Co-authored-by: saienduri <[email protected]> Co-authored-by: Scott Todd <[email protected]>