Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine Translation Multilingual Evaluation #1

Open
wants to merge 89 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
8ac9926
Replace the `fewshot_description` API with a `description_dict` based…
jon-tow Oct 30, 2021
57e064d
Merge branch 'master' of https://github.com/EleutherAI/lm-evaluation-…
jon-tow Oct 31, 2021
792a765
Merge remote-tracking branch 'origin/master' into evaluator-descripti…
leogao2 Nov 1, 2021
0b7a792
Merge remote-tracking branch 'origin/master' into evaluator-descripti…
leogao2 Nov 1, 2021
564e061
Merge branch 'master' of https://github.com/EleutherAI/lm-evaluation-…
jon-tow Dec 6, 2021
ff314d6
Merge branch 'master' of https://github.com/EleutherAI/lm-evaluation-…
jon-tow Dec 15, 2021
1d04c42
Merge
jon-tow Dec 15, 2021
7a35797
Merge branch 'evaluator-description-option' of https://github.com/jon…
jon-tow Dec 15, 2021
ee53be2
Add `provide_description` arg for backward compat
jon-tow Dec 15, 2021
d7a8ab2
Add basic `description_dict`
jon-tow Dec 16, 2021
3fdff22
Remove `print` from test
jon-tow Dec 16, 2021
09cd76c
Fix assertion error string
jon-tow Dec 16, 2021
d131995
Add newline to end of file
jon-tow Dec 16, 2021
10dd7d3
Make `evaluate` and `simple_evaluate` description args consistent
jon-tow Dec 16, 2021
e3ddcfc
Remove needless call
jon-tow Dec 16, 2021
09e1de9
Fix assertion error message
jon-tow Dec 16, 2021
3f06b60
Remove unused import
jon-tow Dec 16, 2021
e54380d
Fix task example link
jon-tow Dec 16, 2021
744482b
Fix task example link
jon-tow Dec 16, 2021
1bc6cdb
Add type info to doc-string
jon-tow Dec 16, 2021
acf76b5
Add `description_dict` docs and update `task-guide`
jon-tow Dec 17, 2021
70f9273
Fix doc reference
jon-tow Dec 17, 2021
d34ae3c
Add `description_dict` to results config
jon-tow Dec 21, 2021
8ebe36b
Add positional arg deprecation decorator
jon-tow Dec 21, 2021
aea963a
Format for consistency
jon-tow Dec 21, 2021
5855f48
Allow users to specify en headqa or es
thomasw21 Dec 23, 2021
cdab2c0
headqa: maintain backwards compatibility
leogao2 Dec 24, 2021
22c4124
Add new testdata
leogao2 Dec 24, 2021
7b2b2a2
Make simple_evaluate take LM and Task objects directly too
leogao2 Dec 24, 2021
d86aabc
more changes
leogao2 Dec 24, 2021
a34bbe6
Remove more `provide_description` uses
jon-tow Dec 24, 2021
57d0718
Remove all `provide_description` argument uses
jon-tow Dec 24, 2021
0e232f7
Update new `task` arg and task dict getter
jon-tow Dec 24, 2021
377a1f4
Merge remote-tracking branch 'origin/master' into thomas/fix_head_qa
thomasw21 Dec 24, 2021
666b615
Fix README
thomasw21 Dec 24, 2021
e34c6bd
Fix README
thomasw21 Dec 24, 2021
3836051
Fix bits_per_byte metric in PerplexityTask
igor0 Dec 25, 2021
df1fc6c
Fix multirc
thomasw21 Dec 26, 2021
23a4206
Add capital letters
thomasw21 Dec 26, 2021
b0a1231
add asdiv task
xagi-dev Dec 26, 2021
bce9f28
remove apps
xagi-dev Dec 26, 2021
440216d
Questions in BoolQ don't have interrogation punctuation at the end
thomasw21 Dec 27, 2021
50ac7df
Merge pull request #240 from bigscience-workshop/thomas/fix_head_qa
leogao2 Dec 29, 2021
97ca18e
Update README.md
leogao2 Dec 29, 2021
0d9d47d
Update README.md
leogao2 Dec 29, 2021
5a53c36
pile: Switch download over to backup host temporarily
leogao2 Dec 29, 2021
f42a8d6
Add testdata for boolq v1
thomasw21 Dec 29, 2021
0bde758
Revert "Add capital letters"
thomasw21 Dec 29, 2021
d7e3248
Pretty sure questions need to be paragraph dependent also
thomasw21 Dec 29, 2021
73d0ae5
Add testdata for multirc v1
thomasw21 Dec 29, 2021
0463573
remove unrequired files&add pin commit hash
xagi-dev Dec 29, 2021
6653cc5
Bump the version number for all tasks based on PerplexityTask
igor0 Dec 30, 2021
0fe2e80
Merge pull request #245 from bigscience-workshop/thomas/improve_boolq
leogao2 Dec 31, 2021
72d7cc0
remove _strip_bracket function
xagi-dev Dec 31, 2021
83e1a11
removed strip_bracket function
xagi-dev Dec 31, 2021
33315a1
pile/wikitext: add testdata
leogao2 Jan 1, 2022
4b3dee6
asdiv: space convention
leogao2 Jan 2, 2022
cb3babd
Improve pile test efficiency
leogao2 Jan 3, 2022
9d87d47
Delete test_cache.db
leogao2 Jan 3, 2022
a67c17e
Merge branch 'master' of github.com:EleutherAI/lm-evaluation-harness
leogao2 Jan 3, 2022
ff58b38
Merge branch 'master' of github.com:EleutherAI/lm-evaluation-harness …
leogao2 Jan 3, 2022
8dbd24f
Merge branch 'master' of github.com:EleutherAI/lm-evaluation-harness …
leogao2 Jan 3, 2022
70a9c47
Merge pull request #242 from igor0/bits_per_byte
leogao2 Jan 4, 2022
8728710
Merge pull request #244 from rokosbasilisk/asdiv
leogao2 Jan 4, 2022
d09561f
Update description_guide.md
leogao2 Jan 4, 2022
d2636b4
Enforce `rnd` args with assertions
jon-tow Jan 4, 2022
c3bec45
Merge branch 'evaluator-description-option' of https://github.com/jon…
jon-tow Jan 4, 2022
76dc609
Best-download have backward compatibility issue
thomasw21 Jan 8, 2022
b2f6bce
Fix fewshot_context method handling
leogao2 Jan 8, 2022
5792030
Merge branch 'master' of github.com:EleutherAI/lm-evaluation-harness …
leogao2 Jan 8, 2022
c65412e
Actually it shouldn't be hard to fix it to be compatible with future …
thomasw21 Jan 8, 2022
02a4def
Update blimp fewshot_context
leogao2 Jan 8, 2022
170ae09
Merge pull request #226 from jon-tow/evaluator-description-option
leogao2 Jan 8, 2022
cc23812
Merge pull request #243 from bigscience-workshop/thomas/fix_multirc
leogao2 Jan 8, 2022
78824d7
Merge branch 'master' into thomas/fix_best_download_version
thomasw21 Jan 8, 2022
ed6931e
Merge pull request #250 from bigscience-workshop/thomas/fix_best_down…
leogao2 Jan 8, 2022
2d9fc25
Missed asdiv
thomasw21 Jan 8, 2022
ea3fd79
Fix CB
thomasw21 Jan 11, 2022
b421f05
Bump version
thomasw21 Jan 11, 2022
be40299
Merge pull request #252 from bigscience-workshop/thomas/fix_best_down…
leogao2 Jan 11, 2022
caff4f1
Conform `asidv` to the new description api
jon-tow Jan 13, 2022
4849702
Merge pull request #256 from jon-tow/update-asdiv
leogao2 Jan 15, 2022
c3d941b
Merge pull request #254 from bigscience-workshop/thomas/fix_cb
leogao2 Jan 15, 2022
1c52e91
cb: add testdata
leogao2 Jan 15, 2022
03c15e0
Update `headqa` deprecation warning to display on init only
jon-tow Jan 23, 2022
26f0233
Merge pull request #259 from jon-tow/headqa-ux
leogao2 Jan 30, 2022
1dc6eb0
adding support of XGLM model
hadyelsahar Feb 1, 2022
c3b7267
Fix new lines tokenizer issue in XGLM
hadyelsahar Feb 10, 2022
fbdc61c
integration of bigscience model
hadyelsahar Feb 17, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ pip install lm-eval

## Basic Usage

To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command.
To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command. **When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility.** This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](https://github.com/EleutherAI/lm-evaluation-harness#task-versioning) section for more info.

```bash
python main.py \
Expand Down Expand Up @@ -55,7 +55,7 @@ To evaluate mesh-transformer-jax models that are not available on HF, please inv

## Implementing new tasks

To implement a new task in eval harness, see [this guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/task-guide.md).
To implement a new task in eval harness, see [this guide](./docs/task_guide.md).

## Cite as

Expand Down Expand Up @@ -128,8 +128,9 @@ To implement a new task in eval harness, see [this guide](https://github.com/Ele
|openbookqa |✓ |✓ |✓ | 500|acc, acc_norm |
|squad2 |✓ |✓ | | 11873|exact, f1, HasAns_exact, HasAns_f1, NoAns_exact, NoAns_f1, best_exact, best_f1|
|race |✓ |✓ |✓ | 1045|acc |
|headqa |✓ |✓ |✓ | 2742|acc, acc_norm |
|mathqa |✓ |✓ |✓ | 2985|acc, acc_norm |
|headqa_es |✓ |✓ |✓ | 2742|acc, acc_norm |
|headqa_en |✓ |✓ |✓ | 2742|acc, acc_norm |
|webqs |✓ | |✓ | 2032|acc |
|wsc273 | | |✓ | 273|acc |
|winogrande |✓ |✓ | | 1267|acc |
Expand Down Expand Up @@ -363,7 +364,6 @@ To inspect what the LM inputs look like, you can run the following command:
```bash
python write_out.py \
--tasks all_tasks \
--provide_description \
--num_fewshot 5 \
--num_examples 10 \
--output_base_path /path/to/output/folder
Expand Down
49 changes: 49 additions & 0 deletions docs/description_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Description Guide

![fewshot-example](./img/fewshot_example_gpt3.png)
(Figure from [Brown et al., 2020](https://arxiv.org/pdf/2005.14165.pdf))

Task descriptions provide in-context task instruction for your language model. If you'd like to prepend a natural language description to your few-shot examples and prompt, you can do so on a per-task basis via the `description_dict` arg of [`evaluator.evaluate`](../lm_eval/evaluator.py). This `description_dict` must adhere to the following key-value structure:

- **key**: the task name (`str`) as specified in the lm-eval-harness [task registry](../lm_eval/tasks/__init__.py).
- **value**: the corresponding (`str`) description/prompt for the task identified by **key**.

```python
description_dict = {
"task_name_1": "description",
"task_name_2": "description",
...
}
```

Note that a task's description will be separated from its following few-shot examples and prompt by a new line as such:

```python
"""
<description>

<examples>

<prompt>
"""
```

## Descriptions in File

One can also interface with the aforementioned [`evaluator.evaluate`](../lm_eval/evaluator.py) (or `evaluator.simple_evaluate`) method from a higher level by simply passing a JSON file path to the `description_dict_path` arg of the command-line interface (CLI) program, `main.py`. The JSON file pointed to should be structured the same as the `description_dict`. E.g. for some file at `/your/path/descriptions.json` you may have:

```json
{
"cycle_letters": "Please unscramble the letters into a word, and write that word:",
"copa": "Given a premise and one alternative with a causal relation to the premise and another without, choose the more plausible alternative"
}
```

which can then be supplied to the CLI as:

```bash
python main.py \
--tasks cycle_letters,copa \
--description_dict_path /your/path/descriptions.json \
...
```
Binary file added docs/img/fewshot_example_gpt3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 6 additions & 14 deletions task-guide.md → docs/task_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,8 +87,7 @@ There are 2 standard approaches we follow for downloading data:
```
These methods return `True`/`False` whether or not your task dataset provides documents for each split type. __Note__: if the test set doesn't have publicly available labels, please do not put it down as having a test set.

Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.:
`{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.: `{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
```python
def training_docs(self):
return #...
Expand Down Expand Up @@ -125,17 +124,9 @@ You can now skip ahead to <a href="#Registering-Your-Task">registering your task

<br>

In the case your task is _not_ multiple-choice, override the following methods for your task class:

In the case your task is not multiple-choice, override the following methods for your task class:

Put the natural language task description as a single line (no `\n`s) string here. E.g. `"Translate English to French:"`

```python
def fewshot_description(self):
return ""
```

Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example (in dictionary form) . You should concatenate its members into a nicely formatted prompt.
Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example of type `dict` with `str` key-value members. You should concatenate these `doc` item values together into a neatly formatted prompt.

```python
def doc_to_text(self, doc):
Expand All @@ -161,11 +152,12 @@ After registering your task, you can now check on your data downloading and veri

```bash
python -m scripts.write_out \
--task <your-task> \
--output_base_path <path> \
--tasks <your-task> \
--sets <train | val | test> \
--num_fewshot K \
--num_examples N
--num_examples N \
--description_dict_path <path>
```

Open the file specified at the `--output_base_path <path>` and ensure it passes
Expand Down
61 changes: 50 additions & 11 deletions lm_eval/base.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import abc
from typing import Iterable
import numpy as np
import random
import re
import os
import json
Expand All @@ -10,7 +11,7 @@
import torch
import torch.nn.functional as F

from lm_eval.metrics import mean, weighted_perplexity, weighted_mean
from lm_eval.metrics import mean, weighted_perplexity, weighted_mean, bits_per_byte
from lm_eval import utils
from abc import abstractmethod

Expand Down Expand Up @@ -450,11 +451,43 @@ def higher_is_better(self):
pass

def fewshot_description(self):
import warnings
warnings.warn(
"`fewshot_description` will be removed in futures versions. Pass "
"any custom descriptions to the `evaluate` function instead.",
DeprecationWarning)
return ""

def fewshot_context(self, doc, num_fewshot, provide_description, rnd):
raw_description = self.fewshot_description()
description = (raw_description + "\n===\n\n") if provide_description and raw_description else ""
@utils.positional_deprecated
def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None):
""" Returns a fewshot context string that is made up of a prepended description
(if provided), the `num_fewshot` number of examples, and an appended prompt example.

:param doc: str
The document as returned from training_docs, validation_docs, or test_docs.
:param num_fewshot: int
The number of fewshot examples to provide in the returned context string.
:param provide_description: bool
Not implemented, and this option is deprecated and will be removed in a future version in favor of a different description providing method
:param rnd: random.Random
The pseudo-random number generator used to randomly sample examples.
WARNING: This is currently a required arg although it's optionalized with a default `None`.
:param description: str
The task's description that will be prepended to the fewshot examples.
:returns: str
The fewshot context.
"""
assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`"
assert not provide_description, (
"The `provide_description` arg will be removed in future versions. To prepend "
"a custom description to the context, supply the corresponding string via the "
"`description` arg."
)
if provide_description is not None:
# nudge people to not specify it at all
print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict")

description = description + "\n\n" if description else ""

if num_fewshot == 0:
labeled_examples = ""
Expand Down Expand Up @@ -523,16 +556,22 @@ class PerplexityTask(Task, abc.ABC):
def has_training_docs(self):
return False

def fewshot_description(self):
return ""

def fewshot_examples(self, k, rnd):
assert k == 0
return []

def fewshot_context(self, doc, num_fewshot, provide_description, rnd):
def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None):
assert num_fewshot == 0
assert not provide_description
assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`"
assert not provide_description, (
"The `provide_description` arg will be removed in future versions. To prepend "
"a custom description to the context, supply the corresponding string via the "
"`description` arg."
)
if provide_description is not None:
# nudge people to not specify it at all
print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict")

return ""

def higher_is_better(self):
Expand Down Expand Up @@ -560,14 +599,14 @@ def process_results(self, doc, results):
return {
"word_perplexity": (loglikelihood, words),
"byte_perplexity": (loglikelihood, bytes_),
"bits_per_byte": (-loglikelihood, self.count_bytes(doc))
"bits_per_byte": (loglikelihood, bytes_),
}

def aggregation(self):
return {
"word_perplexity": weighted_perplexity,
"byte_perplexity": weighted_perplexity,
"bits_per_byte": weighted_mean
"bits_per_byte": bits_per_byte,
}

@classmethod
Expand Down
67 changes: 47 additions & 20 deletions lm_eval/evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,23 @@
import lm_eval.tasks
import lm_eval.base
import numpy as np
from lm_eval.utils import positional_deprecated


def simple_evaluate(model, model_args, task_names,
@positional_deprecated
def simple_evaluate(model, model_args=None, tasks=[],
num_fewshot=0, batch_size=None, device=None,
no_cache=False, limit=None, bootstrap_iters=100000):
no_cache=False, limit=None, bootstrap_iters=100000,
description_dict=None):
"""Instantiate and evaluate a model on a list of tasks.

:param model: str
Name of model, see lm_eval.models.get_model
:param model_args: str
String arguments for each model class, see LM.create_from_arg_string
:param task_names: list[str]
List of task names
:param model: Union[str, LM]
Name of model or LM object, see lm_eval.models.get_model
:param model_args: Optional[str]
String arguments for each model class, see LM.create_from_arg_string.
Ignored if `model` argument is a LM object.
:param tasks: list[Union[str, Task]]
List of task names or Task objects. Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
:param num_fewshot: int
Number of examples in few-shot context
:param batch_size: int, optional
Expand All @@ -31,23 +35,39 @@ def simple_evaluate(model, model_args, task_names,
Limit the number of examples per task (only use this for testing)
:param bootstrap_iters:
Number of iterations for bootstrap statistics
:param description_dict: dict[str, str]
Dictionary of custom task descriptions of the form: `task_name: description`
:return
Dictionary of results
"""
random.seed(1234)
np.random.seed(1234)

lm = lm_eval.models.get_model(model).create_from_arg_string(model_args, {
'batch_size': batch_size, 'device': device
})
assert tasks != [], "No tasks specified"

if isinstance(model, str):
if model_args is None: model_args = ""
lm = lm_eval.models.get_model(model).create_from_arg_string(model_args, {
'batch_size': batch_size, 'device': device
})
else:
assert isinstance(model, lm_eval.base.LM)
lm = model

if not no_cache:
lm = lm_eval.base.CachingLM(
lm, 'lm_cache/' + model + '_' + model_args.replace('=', '-').replace(',', '_').replace('/', '-') + '.db'
)

task_dict = lm_eval.tasks.get_task_dict(task_names)
results = evaluate(lm, task_dict, False, num_fewshot, limit)
task_dict = lm_eval.tasks.get_task_dict(tasks)

results = evaluate(
lm=lm,
task_dict=task_dict,
num_fewshot=num_fewshot,
limit=limit,
description_dict=description_dict
)

# add info about the model and few shot config
results["config"] = {
Expand All @@ -58,19 +78,21 @@ def simple_evaluate(model, model_args, task_names,
"device": device,
"no_cache": no_cache,
"limit": limit,
"bootstrap_iters": bootstrap_iters
"bootstrap_iters": bootstrap_iters,
"description_dict": description_dict
}

return results


def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_iters=100000):
@positional_deprecated
def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None, bootstrap_iters=100000, description_dict=None):
"""Instantiate and evaluate a model on a list of tasks.

:param lm: obj
Language Model
:param task_dict: dict[str, Task]
Dictionary of tasks
Dictionary of tasks. Tasks will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
:param provide_description: bool
Not implemented, and this option is deprecated and will be removed in a future version in favor of a different description providing method
:param num_fewshot: int
Expand All @@ -79,13 +101,18 @@ def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_i
Limit the number of examples per task (only use this for testing)
:param bootstrap_iters:
Number of iterations for bootstrap statistics
:param description_dict: dict[str, str]
Dictionary of custom task descriptions of the form: `task_name: description`
:return
Dictionary of results
"""
# TODO: completely refactor this entire function to not be a huge mess, ideally breaking it down into smaller pieces

# TODO: todo: implement proper description-providing system
assert not provide_description # not implemented.
if provide_description is not None:
# nudge people to not specify it at all
print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict")

task_dict_items = [
(name, task)
Expand Down Expand Up @@ -125,16 +152,16 @@ def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_i
rnd.seed(42)
rnd.shuffle(task_docs)

description = description_dict[task_name] if description_dict and task_name in description_dict else ""

for doc_id, doc in enumerate(itertools.islice(task_docs, 0, limit)):
docs[(task_name, doc_id)] = doc

ctx = task.fewshot_context(
doc=doc,
provide_description=provide_description,
num_fewshot=num_fewshot,
rnd=rnd
rnd=rnd,
description=description
)

reqs = task.construct_requests(doc, ctx)
if not isinstance(reqs, (list, tuple)):
reqs = [reqs]
Expand Down
Loading