Merge branch 'main' into DecodingTrust

AI-secure · Nov 21, 2023 · 71979f3 · 71979f3
2 parents fa1672f + 32342d9
commit 71979f3
Show file tree

Hide file tree

Showing 171 changed files with 7,701 additions and 2,643 deletions.
diff --git a/demo.py b/demo.py
@@ -17,23 +17,30 @@
 print(account.usages)
 
 # Make a request
-request = Request(model="ai21/j1-large", prompt="Life is like a box of", echo_prompt=True)
+request = Request(
+    model="ai21/j2-large", model_deployment="ai21/j2-large", prompt="Life is like a box of", echo_prompt=True
+)
 request_result: RequestResult = service.make_request(auth, request)
 print(request_result.completions[0].text)
 
 # Expect different responses for the same request but with different values for `random`.
 # Passing in the same value for `random` guarantees the same results.
-request = Request(prompt="Life is like a box of", random="1")
+request = Request(model="ai21/j2-large", model_deployment="ai21/j2-large", prompt="Life is like a box of", random="1")
 request_result = service.make_request(auth, request)
 print(request_result.completions[0].text)
 
 # How to get the embedding for some text
-request = Request(model="openai/text-similarity-ada-001", prompt="Life is like a box of", embedding=True)
+request = Request(
+    model="openai/text-similarity-ada-002",
+    model_deployment="openai/text-similarity-ada-002",
+    prompt="Life is like a box of",
+    embedding=True,
+)
 request_result = service.make_request(auth, request)
 print(request_result.embedding)
 
 # Tokenize
-request = TokenizationRequest(tokenizer="ai21/j1-jumbo", text="Tokenize me please.")
+request = TokenizationRequest(tokenizer="ai21/j2-jumbo", text="Tokenize me please.")
 tokenization_request_result: TokenizationRequestResult = service.tokenize(auth, request)
 print(f"Number of tokens: {len(tokenization_request_result.tokens)}")
 

diff --git a/docs/get_helm_rank.md b/docs/get_helm_rank.md
@@ -0,0 +1,84 @@
+# Get Your Model's Leaderboard Rank
+
+This tutorial will show you how to locally add your model into the HELM leaderboard, with in 3 steps:
+
+## Download HELM leaderboard results
+
+First, in order to compare your model to the latest and greatest models found in the [HELM leaderboard](https://crfm.stanford.edu/helm/latest/?group=core_scenarios), use the following command to obtain a zip file of all previous HELM results
+
+```bash
+export LEADERBOARD_VERSION=v0.3.0
+```
+
+Downloaded, expand the file into HELMs results dir:
+
+```bash
+curl -O https://storage.googleapis.com/crfm-helm-public/benchmark_output/archives/$LEADERBOARD_VERSION/run_stats.zip &&\
+mkdir -p benchmark_output/runs/$LEADERBOARD_VERSION && unzip run_stats.zip -d benchmark_output/runs/$LEADERBOARD_VERSION
+```
+
+now that the files are in your results directory, all HELM models will be shown in your UI along with your model.
+
+## Run Efficient-HELM
+
+According to [Efficient Benchmarking (of Language Models)](https://arxiv.org/pdf/2308.11696.pdf) a paper from IBM, which systematically analysed benchmark design choices using the HELM benchmark as an example, one can run the HELM benchmark with a fraction of the examples and still get a reliable estimation of a full run (Perlitz et al., 2023).  
+
+Specifically, the authors calculated the CI $95\%$ of Rank Location from the real ranks as a function of the number of examples used per scenario and came up with the following tradeoffs[^1]:
+
+| Examples Per Scenario | CI $95\%$ of Rank Location | Compute saved |
+| :-------------------: | :------------------------: | :-----------: |
+|         $10$          |           $\pm5$           |  $\times400$  |
+|         $20$          |           $\pm4$           |  $\times200$  |
+|         $50$          |           $\pm3$           |  $\times80$   |
+|         $200$         |           $\pm2$           |  $\times20$   |
+|        $1000$         |           $\pm1$           |   $\times4$   |
+|          All          |           $\pm1$           |   $\times1$   |
+
+
+Choose your point on your tradeoff, how accurate do you need your rank? how much time do you want to wait? Once you have chosen, download the config and define your model
+```bash
+export EXAMPLES_PER_SCENARIO=10 && \
+export MODEL_TO_RUN=huggingface/gpt2
+```
+
+That's it, run the following to get the config file:
+
+```bash
+wget https://raw.githubusercontent.com/stanford-crfm/helm/main/src/helm/benchmark/presentation/run_specs_core_scenarios_$EXAMPLES_PER_SCENARIO.conf -O run_specs_$EXAMPLES_PER_SCENARIO.conf
+```
+
+and this one to run the benchmark (will take some time in the first time since all the data has to be prepared):
+
+```bash
+helm-run \
+--conf-paths run_specs_$EXAMPLES_PER_SCENARIO.conf \
+--suite $LEADERBOARD_VERSION \
+--max-eval-instances $EXAMPLES_PER_SCENARIO \
+--models-to-run $MODEL_TO_RUN \
+--cache-instances \
+--num-train-trials 1 \
+--skip-completed-runs
+```
+
+This will take some time the first time running since all the data (regardless of the number of examples chosen) is downloaded and prepared.
+
+
+## Summarize and serve your results
+
+To view how your model fits in with the latest leaderboard, process and aggregate your results with:
+
+```bash
+helm-summarize --suite $LEADERBOARD_VERSION
+```
+
+And serve with:
+
+```bash
+helm-server
+```
+
+## References List:
+
+```Perlitz, Y., Bandel, E., Gera, A., Arviv, O., Ein-Dor, L., Shnarch, E., Slonim, N., Shmueli-Scheuer, M. and Choshen, L., 2023. Efficient Benchmarking (of Language Models). arXiv preprint arXiv:2308.11696.```
+
+[^1]: Note that the quantities below are the CI $95\%$ of the rank location and are thus very conservative estimates. In our experiments, we did not experience deviations above $\pm2$ for any of the options above.
diff --git a/docs/index.md b/docs/index.md
@@ -11,6 +11,7 @@ To run the code, refer to the User Guide's chapters:
 - [Installation](installation.md)
 - [Quick Start](quick_start.md)
 - [Tutorial](tutorial.md)
+- [Get Your Model's Leaderboard Rank](get_helm_rank.md)
 
 To add new models and scenarios, refer to the Developer Guide's chapters:
 

diff --git a/docs/quick_start.md b/docs/quick_start.md
@@ -18,3 +18,4 @@ helm-server
 
 Then go to http://localhost:8000/ in your browser.
 
+**Next steps:** click [here](get_helm_rank.md) to find out how to to run the full benchmark and get your model's leaderboard rank.
diff --git a/docs/tutorial.md b/docs/tutorial.md
@@ -2,20 +2,20 @@
 
 This tutorial will explain how to use the HELM command line tools to run benchmarks, aggregate statistics, and visualize results.
 
-We will run two runs using the `mmlu` scenario on the `huggingface/gpt-2` model. The `mmlu` scenario implements the **Massive Multitask Language (MMLU)** benchmark from [this paper](https://arxiv.org/pdf/2009.03300.pdf), and consists of a Question Answering (QA) task using a dataset with questions from 57 subjects such as elementary mathematics, US history, computer science, law, and more. Note that GPT-2 performs poorly on MMLU, so this is just a proof of concept. We will run two runs: the first using questions about anatomy, and the second using questions about philosophy.
+We will run two runs using the `mmlu` scenario on the `openai/gpt2` model. The `mmlu` scenario implements the **Massive Multitask Language (MMLU)** benchmark from [this paper](https://arxiv.org/pdf/2009.03300.pdf), and consists of a Question Answering (QA) task using a dataset with questions from 57 subjects such as elementary mathematics, US history, computer science, law, and more. Note that GPT-2 performs poorly on MMLU, so this is just a proof of concept. We will run two runs: the first using questions about anatomy, and the second using questions about philosophy.
 
 ## Using `helm-run`
 
 `helm-run` is a command line tool for running benchmarks.
 
-To run this benchmark using the HELM command-line tools, we need to specify **run spec descriptions** that describes the desired runs. For this example, the run spec descriptions are `mmlu:subject=anatomy,model=huggingface/gpt-2` (for anatomy) and `mmlu:subject=philosophy,model=huggingface/gpt-2` (for philosophy).
+To run this benchmark using the HELM command-line tools, we need to specify **run spec descriptions** that describes the desired runs. For this example, the run spec descriptions are `mmlu:subject=anatomy,model=openai/gpt2` (for anatomy) and `mmlu:subject=philosophy,model=openai/gpt2` (for philosophy).
 
 Next, we need to create a **run spec configuration file** contining these run spec descriptions. A run spec configuration file is a text file containing `RunEntries` serialized to JSON, where each entry in `RunEntries` contains a run spec description. The `description` field of each entry should be a **run spec description**. Create a text file named `run_specs.conf` with the following contents:
 
 ```
 entries: [
-  {description: "mmlu:subject=anatomy,model=huggingface/gpt2", priority: 1},
-  {description: "mmlu:subject=philosophy,model=huggingface/gpt2", priority: 1},
+  {description: "mmlu:subject=anatomy,model=openai/gpt2", priority: 1},
+  {description: "mmlu:subject=philosophy,model=openai/gpt2", priority: 1},
 ]
 ```
 
@@ -35,7 +35,7 @@ The meaning of the additional arguments are as follows:
 -  The environment directory is `prod_env/` by default and can be set using `--local-path`. Credentials for making API calls should be added to a `credentials.conf` file in this directory.
 -  The output directory is `benchmark_output/` by default and can be set using `--output-path`.
 
-After running this command, navigate to the `benchmark_output/runs/v1/` directory. This should contain a two sub-directories named `mmlu:subject=anatomy,model=huggingface_gpt-2` and `mmlu:subject=philosophy,model=huggingface_gpt-2`. Note that the names of these sub-directories is based on the run spec descriptions we used earlier, but with `/` replaced with `_`.
+After running this command, navigate to the `benchmark_output/runs/v1/` directory. This should contain a two sub-directories named `mmlu:subject=anatomy,model=openai_gpt2` and `mmlu:subject=philosophy,model=openai_gpt2`. Note that the names of these sub-directories is based on the run spec descriptions we used earlier, but with `/` replaced with `_`.
 
 Each output sub-directory will contain several JSON files that were generated during the corresponding run:
Original file line number	Diff line number	Diff line change
Expand Up		@@ -18,3 +18,4 @@ helm-server

		Then go to http://localhost:8000/ in your browser.

		Next steps: click [here](get_helm_rank.md) to find out how to to run the full benchmark and get your model's leaderboard rank.