Skip to content

Commit

Permalink
adds llamacpp benchmarking support (#263)
Browse files Browse the repository at this point in the history
  • Loading branch information
ramkrishna2910 authored Jan 10, 2025
1 parent ed30c98 commit 4e7450d
Show file tree
Hide file tree
Showing 5 changed files with 678 additions and 99 deletions.
132 changes: 105 additions & 27 deletions docs/llamacpp.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,126 @@
# LLAMA.CPP

Run transformer models using a Llama.cpp binary and checkpoint. This model can then be used with chatting or benchmarks such as MMLU.
Run transformer models using llama.cpp. This integration allows you to:
1. Load and run llama.cpp models
2. Benchmark model performance
3. Use the models with other tools like chat or MMLU accuracy testing

## Prerequisites

This flow has been verified with a generic Llama.cpp model.
You need:
1. A compiled llama.cpp executable (llama-cli or llama-cli.exe)
2. A GGUF model file

These instructions are only for linux or Windows with wsl. It may be necessary to be running WSL in an Administrator command prompt.
### Building llama.cpp (if needed)

These instructions also assumes that lemonade has been installed.
#### Linux
```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
```

#### Windows
```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release
```

The executable will be in `build/bin/Release/llama-cli.exe` on Windows or `llama-cli` in the root directory on Linux.

### Set up Environment (Assumes TurnkeyML is already installed)
## Usage

Build or obtain the Llama.cpp model and desired checkpoint.
For example (see the [llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md
) source for more details):
1. cd ~
1. git clone https://github.com/ggerganov/llama.cpp
1. cd llama.cpp
1. make
1. cd models
1. wget https://huggingface.co/TheBloke/Dolphin-Llama2-7B-GGUF/resolve/main/dolphin-llama2-7b.Q5_K_M.gguf
### Loading a Model

Use the `load-llama-cpp` tool to load a model:

## Usage
```bash
lemonade -i MODEL_NAME load-llama-cpp \
--executable PATH_TO_EXECUTABLE \
--model-binary PATH_TO_GGUF_FILE
```

The Llama.cpp tool currently supports the following parameters
Parameters:
| Parameter | Required | Default | Description |
|--------------|----------|---------|-------------------------------------------------------|
| executable | Yes | - | Path to llama-cli/llama-cli.exe |
| model-binary | Yes | - | Path to .gguf model file |
| threads | No | 1 | Number of threads for generation |
| context-size | No | 512 | Context window size |
| output-tokens| No | 512 | Maximum number of tokens to generate |

| Parameter | Definition | Default |
| --------- | ---------------------------------------------------- | ------- |
| executable | Path to the Llama.cpp-generated application binary | None |
| model-binary | Model checkpoint (do not use if --input is passed to lemonade) | None |
| threads | Number of threads to use for computation | 1 |
| context-size | Maximum context length | 512 |
| temp | Temperature to use for inference (leave out to use the application default) | None |
### Benchmarking

### Example (assuming Llama.cpp built and a checkpoint loaded as above)
After loading a model, you can benchmark it using `llama-cpp-bench`:

```bash
lemonade --input ~/llama.cpp/models/dolphin-llama2-7b.Q5_K_M.gguf load-llama-cpp --executable ~/llama.cpp/llama-cli accuracy-mmlu --ntrain 5
lemonade -i MODEL_NAME \
load-llama-cpp \
--executable PATH_TO_EXECUTABLE \
--model-binary PATH_TO_GGUF_FILE \
llama-cpp-bench
```

On windows, the llama.cpp binary might be in a different location (such as llama.cpp\build\bin\Release\), in which case the command mgiht be something like:
Benchmark parameters:
| Parameter | Default | Description |
|------------------|----------------------------|-------------------------------------------|
| prompt | "Hello, I am conscious and"| Input prompt for benchmarking |
| context-size | 512 | Context window size |
| output-tokens | 512 | Number of tokens to generate |
| iterations | 1 | Number of benchmark iterations |
| warmup-iterations| 0 | Number of warmup iterations (not counted) |

The benchmark will measure and report:
- Time to first token (prompt evaluation time)
- Token generation speed (tokens per second)

### Example Commands

#### Windows Example
```bash
lemonade --input ~\llama.cpp\models\dolphin-llama2-7b.Q5_K_M.gguf load-llama-cpp --executable ~\llama.cpp\build\bin\Release\llama-cli accuracy-mmlu --ntrain 5
# Load and benchmark a model
lemonade -i Qwen/Qwen2.5-0.5B-Instruct-GGUF \
load-llama-cpp \
--executable "C:\work\llama.cpp\build\bin\Release\llama-cli.exe" \
--model-binary "C:\work\llama.cpp\models\qwen2.5-0.5b-instruct-fp16.gguf" \
llama-cpp-bench \
--iterations 3 \
--warmup-iterations 1

# Run MMLU accuracy test
lemonade -i Qwen/Qwen2.5-0.5B-Instruct-GGUF \
load-llama-cpp \
--executable "C:\work\llama.cpp\build\bin\Release\llama-cli.exe" \
--model-binary "C:\work\llama.cpp\models\qwen2.5-0.5b-instruct-fp16.gguf" \
accuracy-mmlu \
--tests management \
--max-evals 2
```

#### Linux Example
```bash
# Load and benchmark a model
lemonade -i Qwen/Qwen2.5-0.5B-Instruct-GGUF \
load-llama-cpp \
--executable "./llama-cli" \
--model-binary "./models/qwen2.5-0.5b-instruct-fp16.gguf" \
llama-cpp-bench \
--iterations 3 \
--warmup-iterations 1
```

## Integration with Other Tools

After loading with `load-llama-cpp`, the model can be used with any tool that supports the ModelAdapter interface, including:
- accuracy-mmlu
- llm-prompt
- accuracy-humaneval
- and more

The integration provides:
- Platform-independent path handling (works on both Windows and Linux)
- Proper error handling with detailed messages
- Performance metrics collection
- Configurable generation parameters (temperature, top_p, top_k)
3 changes: 2 additions & 1 deletion src/lemonade/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

from lemonade.tools.huggingface_bench import HuggingfaceBench
from lemonade.tools.ort_genai.oga_bench import OgaBench

from lemonade.tools.llamacpp_bench import LlamaCppBench
from lemonade.tools.llamacpp import LoadLlamaCpp

import lemonade.cache as cache
Expand All @@ -30,6 +30,7 @@ def main():
tools = [
HuggingfaceLoad,
LoadLlamaCpp,
LlamaCppBench,
AccuracyMMLU,
AccuracyHumaneval,
AccuracyPerplexity,
Expand Down
Loading

0 comments on commit 4e7450d

Please sign in to comment.