Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor of perplexity computation #1197

Merged
merged 98 commits into from
Nov 10, 2023
Merged

Refactor of perplexity computation #1197

merged 98 commits into from
Nov 10, 2023

Conversation

anmarques
Copy link
Member

@anmarques anmarques commented Aug 23, 2023

Refactor intended to simplify perplexity computation and add support for different datasets into the same codebase. Among the changes, these are the highlights:

  1. The perplexity class is now agnostic of the pipeline. It only sees predictions and targets. It does have an argument called "accumulate" that indicates whether to compute ppl for each sample separately or accumulate across samples.
  2. Handle masking of tokens and logits in the perplexity_eval function. To avoid complications regarding the attention mask not being uniform within a batch, process each sample separately even if the pipeline is executed in batched mode.
  3. Add logic to split data for wikitext such that each sample has the same number of tokens.

Testing plan:

Verified ppl for base Codegen 350M mono:

  • deepsparse.transformers.eval_downstream --batch-size 16 --max-sequence-length 1024 -d openai_humaneval
    Result: mean ppl: 3.60 (PyTorch: 3.60)

Verified ppl for OPT base 1.3b:

  • deepsparse.transformers.eval_downstream --batch-size 16 --max-sequence-length 2048 -d wikitext
    Result: mean ppl: 14.62 (PyTorch: 14.63)

NOTE: This pipeline was only tested for non-cached models. It should work with kv-cache models as well. Right now the pipeline is created with sequence_length=args.max_sequence_length and prompt_processing_sequence_length=args.max_sequence_length. As soon as the kv-cache issues around this case are resolved we should test ppl evaluation again.

Update: Added support to c4 dataset in a way that complies with both the subsets defined in SparseGPT and LLM-foundry. Validated on cached models as well.

bfineran
bfineran previously approved these changes Aug 23, 2023
src/deepsparse/transformers/eval_downstream.py Outdated Show resolved Hide resolved
bfineran
bfineran previously approved these changes Oct 23, 2023
Copy link
Contributor

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks a lot better. still need to verify testing cases. Could you point to where those files are? Don't seem to be a part of this PR.

bfineran
bfineran previously approved these changes Nov 1, 2023
dbogunowicz
dbogunowicz previously approved these changes Nov 7, 2023
@anmarques anmarques dismissed stale reviews from dbogunowicz and bfineran via 21c6f0d November 8, 2023 15:26
@anmarques anmarques requested a review from dsikka November 8, 2023 22:31
@anmarques anmarques merged commit 86490b0 into main Nov 10, 2023
13 checks passed
@anmarques anmarques deleted the research/ppl_refactor branch November 10, 2023 15:45
dbogunowicz added a commit that referenced this pull request Nov 13, 2023
* Add input_tokes as optional output

* Refactor Perplexity class to only compute perplexity. All other task-specific processing is handled elsewhere

* Simplify perplexity evaluation. Evaluation takes place as batch size 1 only, so no need to consider batched execution. In addition, use input_tokens from generation pipeline

* Splits wikitext at regular intervals of the same length as the sequence length

* Add argument for accumulation of negative log likelihood

* Accumulate likelihood for wikitext

* Simplification

* Add support for wikitext-style ppl evaluation

* Compute batch instead of storing until compute method. This drastically reduced memory requirements

* Remove torch dependency

* Move split of dataset into helper function

* Quality fixes

* Remove debugging prints

* Remove debugging prints

* Incorporate fixes for kv-cache

* Include doc string for accumulate

* Add support to trust-remote-code arguments

* Add support to c4

* add a missing include_prompt_logits param

* Remove unnecessary capping at sequence length (it's incorrect for cached models)

* Simplify processing for concatenated datasets

* Fix kv cache update

* Fix kv cache update

* Quality fixes

* remove batch size from pipeline instantiation

* Rename to wikitext2

* Remove trust_remote_code argument

* Remove use_deepsparse_cache argument

* Change padding of output to left in order to match padding of input ids and attention mask

* Allow trust_remote_code to be passed as argument (in some cases tokenizer can be defined by custom code)

* Move process_concatenated_datasets to helpers file

* Added support for max_text_length to speed up processing of long datasets

* Rebase w/ main

* Rebase w/ main

* Fix typo

* Rebase

* Use max_length instead of max_new_tokens

* Rebase

* Added typing and docstring

* Added typing and docstring

* Define concantenated datasets

* Add warning about batch-size not being a supported argument for some datasets

* Add unit test for pipeline and generation in ppl eval

* Add lifecycle in docstring

* Add copyright

* Style fixes

* Quality fixes

* Quality fixes

* Quality fixes

* Quality fixes

* Quality fixes

* Quality fixes

* Quality fixes

* Quality fixes

* Quality fixes

* Quality fixes

* Rebase

* Rebase

* Re-add unit test

* Style fix

* Update unit test

* Update unit test

---------

Co-authored-by: dbogunowicz <[email protected]>
Co-authored-by: Damian <[email protected]>
Co-authored-by: Benjamin Fineran <[email protected]>
Co-authored-by: Rahul Tuli <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants