Skip to content

Commit

Permalink
Merge branch 'main' into DecodingTrust
Browse files Browse the repository at this point in the history
  • Loading branch information
danielz02 committed Nov 21, 2023
2 parents fa1672f + 32342d9 commit 71979f3
Show file tree
Hide file tree
Showing 171 changed files with 7,701 additions and 2,643 deletions.
15 changes: 11 additions & 4 deletions demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,23 +17,30 @@
print(account.usages)

# Make a request
request = Request(model="ai21/j1-large", prompt="Life is like a box of", echo_prompt=True)
request = Request(
model="ai21/j2-large", model_deployment="ai21/j2-large", prompt="Life is like a box of", echo_prompt=True
)
request_result: RequestResult = service.make_request(auth, request)
print(request_result.completions[0].text)

# Expect different responses for the same request but with different values for `random`.
# Passing in the same value for `random` guarantees the same results.
request = Request(prompt="Life is like a box of", random="1")
request = Request(model="ai21/j2-large", model_deployment="ai21/j2-large", prompt="Life is like a box of", random="1")
request_result = service.make_request(auth, request)
print(request_result.completions[0].text)

# How to get the embedding for some text
request = Request(model="openai/text-similarity-ada-001", prompt="Life is like a box of", embedding=True)
request = Request(
model="openai/text-similarity-ada-002",
model_deployment="openai/text-similarity-ada-002",
prompt="Life is like a box of",
embedding=True,
)
request_result = service.make_request(auth, request)
print(request_result.embedding)

# Tokenize
request = TokenizationRequest(tokenizer="ai21/j1-jumbo", text="Tokenize me please.")
request = TokenizationRequest(tokenizer="ai21/j2-jumbo", text="Tokenize me please.")
tokenization_request_result: TokenizationRequestResult = service.tokenize(auth, request)
print(f"Number of tokens: {len(tokenization_request_result.tokens)}")

Expand Down
84 changes: 84 additions & 0 deletions docs/get_helm_rank.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Get Your Model's Leaderboard Rank

This tutorial will show you how to locally add your model into the HELM leaderboard, with in 3 steps:

## Download HELM leaderboard results

First, in order to compare your model to the latest and greatest models found in the [HELM leaderboard](https://crfm.stanford.edu/helm/latest/?group=core_scenarios), use the following command to obtain a zip file of all previous HELM results

```bash
export LEADERBOARD_VERSION=v0.3.0
```

Downloaded, expand the file into HELMs results dir:

```bash
curl -O https://storage.googleapis.com/crfm-helm-public/benchmark_output/archives/$LEADERBOARD_VERSION/run_stats.zip &&\
mkdir -p benchmark_output/runs/$LEADERBOARD_VERSION && unzip run_stats.zip -d benchmark_output/runs/$LEADERBOARD_VERSION
```

now that the files are in your results directory, all HELM models will be shown in your UI along with your model.

## Run Efficient-HELM

According to [Efficient Benchmarking (of Language Models)](https://arxiv.org/pdf/2308.11696.pdf) a paper from IBM, which systematically analysed benchmark design choices using the HELM benchmark as an example, one can run the HELM benchmark with a fraction of the examples and still get a reliable estimation of a full run (Perlitz et al., 2023).

Specifically, the authors calculated the CI $95\%$ of Rank Location from the real ranks as a function of the number of examples used per scenario and came up with the following tradeoffs[^1]:

| Examples Per Scenario | CI $95\%$ of Rank Location | Compute saved |
| :-------------------: | :------------------------: | :-----------: |
| $10$ | $\pm5$ | $\times400$ |
| $20$ | $\pm4$ | $\times200$ |
| $50$ | $\pm3$ | $\times80$ |
| $200$ | $\pm2$ | $\times20$ |
| $1000$ | $\pm1$ | $\times4$ |
| All | $\pm1$ | $\times1$ |


Choose your point on your tradeoff, how accurate do you need your rank? how much time do you want to wait? Once you have chosen, download the config and define your model
```bash
export EXAMPLES_PER_SCENARIO=10 && \
export MODEL_TO_RUN=huggingface/gpt2
```

That's it, run the following to get the config file:

```bash
wget https://raw.githubusercontent.com/stanford-crfm/helm/main/src/helm/benchmark/presentation/run_specs_core_scenarios_$EXAMPLES_PER_SCENARIO.conf -O run_specs_$EXAMPLES_PER_SCENARIO.conf
```

and this one to run the benchmark (will take some time in the first time since all the data has to be prepared):

```bash
helm-run \
--conf-paths run_specs_$EXAMPLES_PER_SCENARIO.conf \
--suite $LEADERBOARD_VERSION \
--max-eval-instances $EXAMPLES_PER_SCENARIO \
--models-to-run $MODEL_TO_RUN \
--cache-instances \
--num-train-trials 1 \
--skip-completed-runs
```

This will take some time the first time running since all the data (regardless of the number of examples chosen) is downloaded and prepared.


## Summarize and serve your results

To view how your model fits in with the latest leaderboard, process and aggregate your results with:

```bash
helm-summarize --suite $LEADERBOARD_VERSION
```

And serve with:

```bash
helm-server
```

## References List:

```Perlitz, Y., Bandel, E., Gera, A., Arviv, O., Ein-Dor, L., Shnarch, E., Slonim, N., Shmueli-Scheuer, M. and Choshen, L., 2023. Efficient Benchmarking (of Language Models). arXiv preprint arXiv:2308.11696.```

[^1]: Note that the quantities below are the CI $95\%$ of the rank location and are thus very conservative estimates. In our experiments, we did not experience deviations above $\pm2$ for any of the options above.
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ To run the code, refer to the User Guide's chapters:
- [Installation](installation.md)
- [Quick Start](quick_start.md)
- [Tutorial](tutorial.md)
- [Get Your Model's Leaderboard Rank](get_helm_rank.md)

To add new models and scenarios, refer to the Developer Guide's chapters:

Expand Down
1 change: 1 addition & 0 deletions docs/quick_start.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,4 @@ helm-server

Then go to http://localhost:8000/ in your browser.

**Next steps:** click [here](get_helm_rank.md) to find out how to to run the full benchmark and get your model's leaderboard rank.
10 changes: 5 additions & 5 deletions docs/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,20 @@

This tutorial will explain how to use the HELM command line tools to run benchmarks, aggregate statistics, and visualize results.

We will run two runs using the `mmlu` scenario on the `huggingface/gpt-2` model. The `mmlu` scenario implements the **Massive Multitask Language (MMLU)** benchmark from [this paper](https://arxiv.org/pdf/2009.03300.pdf), and consists of a Question Answering (QA) task using a dataset with questions from 57 subjects such as elementary mathematics, US history, computer science, law, and more. Note that GPT-2 performs poorly on MMLU, so this is just a proof of concept. We will run two runs: the first using questions about anatomy, and the second using questions about philosophy.
We will run two runs using the `mmlu` scenario on the `openai/gpt2` model. The `mmlu` scenario implements the **Massive Multitask Language (MMLU)** benchmark from [this paper](https://arxiv.org/pdf/2009.03300.pdf), and consists of a Question Answering (QA) task using a dataset with questions from 57 subjects such as elementary mathematics, US history, computer science, law, and more. Note that GPT-2 performs poorly on MMLU, so this is just a proof of concept. We will run two runs: the first using questions about anatomy, and the second using questions about philosophy.

## Using `helm-run`

`helm-run` is a command line tool for running benchmarks.

To run this benchmark using the HELM command-line tools, we need to specify **run spec descriptions** that describes the desired runs. For this example, the run spec descriptions are `mmlu:subject=anatomy,model=huggingface/gpt-2` (for anatomy) and `mmlu:subject=philosophy,model=huggingface/gpt-2` (for philosophy).
To run this benchmark using the HELM command-line tools, we need to specify **run spec descriptions** that describes the desired runs. For this example, the run spec descriptions are `mmlu:subject=anatomy,model=openai/gpt2` (for anatomy) and `mmlu:subject=philosophy,model=openai/gpt2` (for philosophy).

Next, we need to create a **run spec configuration file** contining these run spec descriptions. A run spec configuration file is a text file containing `RunEntries` serialized to JSON, where each entry in `RunEntries` contains a run spec description. The `description` field of each entry should be a **run spec description**. Create a text file named `run_specs.conf` with the following contents:

```
entries: [
{description: "mmlu:subject=anatomy,model=huggingface/gpt2", priority: 1},
{description: "mmlu:subject=philosophy,model=huggingface/gpt2", priority: 1},
{description: "mmlu:subject=anatomy,model=openai/gpt2", priority: 1},
{description: "mmlu:subject=philosophy,model=openai/gpt2", priority: 1},
]
```

Expand All @@ -35,7 +35,7 @@ The meaning of the additional arguments are as follows:
- The environment directory is `prod_env/` by default and can be set using `--local-path`. Credentials for making API calls should be added to a `credentials.conf` file in this directory.
- The output directory is `benchmark_output/` by default and can be set using `--output-path`.

After running this command, navigate to the `benchmark_output/runs/v1/` directory. This should contain a two sub-directories named `mmlu:subject=anatomy,model=huggingface_gpt-2` and `mmlu:subject=philosophy,model=huggingface_gpt-2`. Note that the names of these sub-directories is based on the run spec descriptions we used earlier, but with `/` replaced with `_`.
After running this command, navigate to the `benchmark_output/runs/v1/` directory. This should contain a two sub-directories named `mmlu:subject=anatomy,model=openai_gpt2` and `mmlu:subject=philosophy,model=openai_gpt2`. Note that the names of these sub-directories is based on the run spec descriptions we used earlier, but with `/` replaced with `_`.

Each output sub-directory will contain several JSON files that were generated during the corresponding run:

Expand Down
Loading

0 comments on commit 71979f3

Please sign in to comment.