-
Notifications
You must be signed in to change notification settings - Fork 20
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
adds llamacpp benchmarking support (#263)
- Loading branch information
1 parent
ed30c98
commit 4e7450d
Showing
5 changed files
with
678 additions
and
99 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,48 +1,126 @@ | ||
# LLAMA.CPP | ||
|
||
Run transformer models using a Llama.cpp binary and checkpoint. This model can then be used with chatting or benchmarks such as MMLU. | ||
Run transformer models using llama.cpp. This integration allows you to: | ||
1. Load and run llama.cpp models | ||
2. Benchmark model performance | ||
3. Use the models with other tools like chat or MMLU accuracy testing | ||
|
||
## Prerequisites | ||
|
||
This flow has been verified with a generic Llama.cpp model. | ||
You need: | ||
1. A compiled llama.cpp executable (llama-cli or llama-cli.exe) | ||
2. A GGUF model file | ||
|
||
These instructions are only for linux or Windows with wsl. It may be necessary to be running WSL in an Administrator command prompt. | ||
### Building llama.cpp (if needed) | ||
|
||
These instructions also assumes that lemonade has been installed. | ||
#### Linux | ||
```bash | ||
git clone https://github.com/ggerganov/llama.cpp | ||
cd llama.cpp | ||
make | ||
``` | ||
|
||
#### Windows | ||
```bash | ||
git clone https://github.com/ggerganov/llama.cpp | ||
cd llama.cpp | ||
cmake -B build | ||
cmake --build build --config Release | ||
``` | ||
|
||
The executable will be in `build/bin/Release/llama-cli.exe` on Windows or `llama-cli` in the root directory on Linux. | ||
|
||
### Set up Environment (Assumes TurnkeyML is already installed) | ||
## Usage | ||
|
||
Build or obtain the Llama.cpp model and desired checkpoint. | ||
For example (see the [llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md | ||
) source for more details): | ||
1. cd ~ | ||
1. git clone https://github.com/ggerganov/llama.cpp | ||
1. cd llama.cpp | ||
1. make | ||
1. cd models | ||
1. wget https://huggingface.co/TheBloke/Dolphin-Llama2-7B-GGUF/resolve/main/dolphin-llama2-7b.Q5_K_M.gguf | ||
### Loading a Model | ||
|
||
Use the `load-llama-cpp` tool to load a model: | ||
|
||
## Usage | ||
```bash | ||
lemonade -i MODEL_NAME load-llama-cpp \ | ||
--executable PATH_TO_EXECUTABLE \ | ||
--model-binary PATH_TO_GGUF_FILE | ||
``` | ||
|
||
The Llama.cpp tool currently supports the following parameters | ||
Parameters: | ||
| Parameter | Required | Default | Description | | ||
|--------------|----------|---------|-------------------------------------------------------| | ||
| executable | Yes | - | Path to llama-cli/llama-cli.exe | | ||
| model-binary | Yes | - | Path to .gguf model file | | ||
| threads | No | 1 | Number of threads for generation | | ||
| context-size | No | 512 | Context window size | | ||
| output-tokens| No | 512 | Maximum number of tokens to generate | | ||
|
||
| Parameter | Definition | Default | | ||
| --------- | ---------------------------------------------------- | ------- | | ||
| executable | Path to the Llama.cpp-generated application binary | None | | ||
| model-binary | Model checkpoint (do not use if --input is passed to lemonade) | None | | ||
| threads | Number of threads to use for computation | 1 | | ||
| context-size | Maximum context length | 512 | | ||
| temp | Temperature to use for inference (leave out to use the application default) | None | | ||
### Benchmarking | ||
|
||
### Example (assuming Llama.cpp built and a checkpoint loaded as above) | ||
After loading a model, you can benchmark it using `llama-cpp-bench`: | ||
|
||
```bash | ||
lemonade --input ~/llama.cpp/models/dolphin-llama2-7b.Q5_K_M.gguf load-llama-cpp --executable ~/llama.cpp/llama-cli accuracy-mmlu --ntrain 5 | ||
lemonade -i MODEL_NAME \ | ||
load-llama-cpp \ | ||
--executable PATH_TO_EXECUTABLE \ | ||
--model-binary PATH_TO_GGUF_FILE \ | ||
llama-cpp-bench | ||
``` | ||
|
||
On windows, the llama.cpp binary might be in a different location (such as llama.cpp\build\bin\Release\), in which case the command mgiht be something like: | ||
Benchmark parameters: | ||
| Parameter | Default | Description | | ||
|------------------|----------------------------|-------------------------------------------| | ||
| prompt | "Hello, I am conscious and"| Input prompt for benchmarking | | ||
| context-size | 512 | Context window size | | ||
| output-tokens | 512 | Number of tokens to generate | | ||
| iterations | 1 | Number of benchmark iterations | | ||
| warmup-iterations| 0 | Number of warmup iterations (not counted) | | ||
|
||
The benchmark will measure and report: | ||
- Time to first token (prompt evaluation time) | ||
- Token generation speed (tokens per second) | ||
|
||
### Example Commands | ||
|
||
#### Windows Example | ||
```bash | ||
lemonade --input ~\llama.cpp\models\dolphin-llama2-7b.Q5_K_M.gguf load-llama-cpp --executable ~\llama.cpp\build\bin\Release\llama-cli accuracy-mmlu --ntrain 5 | ||
# Load and benchmark a model | ||
lemonade -i Qwen/Qwen2.5-0.5B-Instruct-GGUF \ | ||
load-llama-cpp \ | ||
--executable "C:\work\llama.cpp\build\bin\Release\llama-cli.exe" \ | ||
--model-binary "C:\work\llama.cpp\models\qwen2.5-0.5b-instruct-fp16.gguf" \ | ||
llama-cpp-bench \ | ||
--iterations 3 \ | ||
--warmup-iterations 1 | ||
|
||
# Run MMLU accuracy test | ||
lemonade -i Qwen/Qwen2.5-0.5B-Instruct-GGUF \ | ||
load-llama-cpp \ | ||
--executable "C:\work\llama.cpp\build\bin\Release\llama-cli.exe" \ | ||
--model-binary "C:\work\llama.cpp\models\qwen2.5-0.5b-instruct-fp16.gguf" \ | ||
accuracy-mmlu \ | ||
--tests management \ | ||
--max-evals 2 | ||
``` | ||
|
||
#### Linux Example | ||
```bash | ||
# Load and benchmark a model | ||
lemonade -i Qwen/Qwen2.5-0.5B-Instruct-GGUF \ | ||
load-llama-cpp \ | ||
--executable "./llama-cli" \ | ||
--model-binary "./models/qwen2.5-0.5b-instruct-fp16.gguf" \ | ||
llama-cpp-bench \ | ||
--iterations 3 \ | ||
--warmup-iterations 1 | ||
``` | ||
|
||
## Integration with Other Tools | ||
|
||
After loading with `load-llama-cpp`, the model can be used with any tool that supports the ModelAdapter interface, including: | ||
- accuracy-mmlu | ||
- llm-prompt | ||
- accuracy-humaneval | ||
- and more | ||
|
||
The integration provides: | ||
- Platform-independent path handling (works on both Windows and Linux) | ||
- Proper error handling with detailed messages | ||
- Performance metrics collection | ||
- Configurable generation parameters (temperature, top_p, top_k) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.