Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

[ CI ] LM Eval Testing Expansion #326

Merged
merged 51 commits into from
Jun 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
19f06cf
configs for expanded lm-eval testing
robertgshaw2-redhat Jun 22, 2024
02d9647
updated configs
robertgshaw2-redhat Jun 22, 2024
b848d3c
added many configs
robertgshaw2-redhat Jun 22, 2024
a5cac54
stash
robertgshaw2-redhat Jun 23, 2024
a260f8a
updated
robertgshaw2-redhat Jun 23, 2024
cc70508
nit on large models
robertgshaw2-redhat Jun 23, 2024
f7e1aca
cleanup configs
robertgshaw2-redhat Jun 23, 2024
3196d6c
rmove changes to utils.py
robertgshaw2-redhat Jun 23, 2024
906518f
lint
robertgshaw2-redhat Jun 23, 2024
6b03af6
cleanup utils.py
robertgshaw2-redhat Jun 23, 2024
bea9e60
remove comment
robertgshaw2-redhat Jun 23, 2024
8fdca19
added skipped files
robertgshaw2-redhat Jun 23, 2024
08ff3a3
update actions
robertgshaw2-redhat Jun 23, 2024
a8727d0
re added
robertgshaw2-redhat Jun 23, 2024
b0edd0a
fix typo in action
robertgshaw2-redhat Jun 23, 2024
115c588
nit
robertgshaw2-redhat Jun 23, 2024
d9c804e
nit
robertgshaw2-redhat Jun 23, 2024
441d718
removed utils.py changes
robertgshaw2-redhat Jun 23, 2024
e537aef
fix workflow
robertgshaw2-redhat Jun 23, 2024
fcfbd5e
config
robertgshaw2-redhat Jun 23, 2024
999e056
fix workflow hopefully
robertgshaw2-redhat Jun 23, 2024
1fa67e3
fixed lm-eval-workflow
robertgshaw2-redhat Jun 23, 2024
e788687
one more time...
robertgshaw2-redhat Jun 23, 2024
c7471de
added vllm baselining script
robertgshaw2-redhat Jun 23, 2024
19163d6
last multi typo
robertgshaw2-redhat Jun 23, 2024
df3e138
pass the correct config file
robertgshaw2-redhat Jun 23, 2024
48395f5
Merge branch 'main' into expand-lm-eval-testing
robertgshaw2-redhat Jun 24, 2024
5ffd63d
Update nm-run-lm-eval-vllm.sh
robertgshaw2-redhat Jun 24, 2024
9d21016
Merge branch 'main' into expand-lm-eval-testing
robertgshaw2-redhat Jun 25, 2024
d701fd2
convert lm-eval test script to avoid for loop
robertgshaw2-redhat Jun 25, 2024
4bcaac3
stash
robertgshaw2-redhat Jun 25, 2024
f5fc48c
removed multi gpu tests
robertgshaw2-redhat Jun 25, 2024
0e19bb5
nit
robertgshaw2-redhat Jun 25, 2024
a499686
clean up lm-eval labels
robertgshaw2-redhat Jun 25, 2024
b173468
spurious change
robertgshaw2-redhat Jun 25, 2024
877990e
fix types
robertgshaw2-redhat Jun 25, 2024
531d1c3
fix workflow
robertgshaw2-redhat Jun 25, 2024
04a06ad
removed phi from small models, it is 28GB
robertgshaw2-redhat Jun 25, 2024
86513a9
format
robertgshaw2-redhat Jun 25, 2024
cc24664
bump up timeout
robertgshaw2-redhat Jun 25, 2024
a8f701a
comment
robertgshaw2-redhat Jun 25, 2024
811d3a6
format
robertgshaw2-redhat Jun 25, 2024
d1844db
Update nm-nightly.yml
robertgshaw2-redhat Jun 25, 2024
334de0e
Update smoke-small-models.txt
robertgshaw2-redhat Jun 25, 2024
cfb5af6
Merge branch 'main' into expand-lm-eval-testing
robertgshaw2-redhat Jun 26, 2024
085e39c
Update build.sh
robertgshaw2-redhat Jun 26, 2024
7cdc163
Update format.sh
robertgshaw2-redhat Jun 26, 2024
adabde7
Update format.sh
robertgshaw2-redhat Jun 26, 2024
ba59010
Update loader.py
robertgshaw2-redhat Jun 26, 2024
ff0ea23
Merge branch 'main' into expand-lm-eval-testing
robertgshaw2-redhat Jun 26, 2024
95eb999
format
robertgshaw2-redhat Jun 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .github/lm-eval-configs/full-large-models.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Meta-Llama-3-70B-Instruct-FP8.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure where to put this, but it might be good to have a brief README in this repo with: a sketch of hardware requirements for these models and brief description of the various items in the "yaml". As an example for the latter, what does num_fewshot mean?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ill add this in the follow up

Meta-Llama-3-70B-Instruct.yaml
Mixtral-8x22B-Instruct-v0.1-FP8.yaml
Mixtral-8x22B-Instruct-v0.1.yaml
Mixtral-8x7B-Instruct-v0.1-FP8.yaml
Mixtral-8x7B-Instruct-v0.1.yaml
Qwen2-57B-A14B-Instruct.yaml
Qwen2-72B-Instruct.yaml
Phi-3-medium-4k-instruct.yaml
7 changes: 7 additions & 0 deletions .github/lm-eval-configs/full-small-models.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
gemma-7b-it.yaml
Meta-Llama-3-8B-Instruct-FP8-KV.yaml
Meta-Llama-3-8B-Instruct-FP8.yaml
Meta-Llama-3-8B-Instruct-W4A16.yaml
Meta-Llama-3-8B-Instruct.yaml
Mistral-7B-Instruct-v0.3.yaml
Qwen2-7B-Instruct.yaml
11 changes: 11 additions & 0 deletions .github/lm-eval-configs/models/Meta-Llama-3-70B-Instruct-FP8.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# ./nm-run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-70B-Instruct -b 32 -l 250 -f 5
model_name: "meta-llama/Meta-Llama-3-70B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.900
- name: "exact_match,flexible-extract"
value: 0.900
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .github/lm-eval-configs/models/Meta-Llama-3-70B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# ./nm-run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-70B-Instruct -b 32 -l 250 -f 5
model_name: "meta-llama/Meta-Llama-3-70B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.888
- name: "exact_match,flexible-extract"
value: 0.888
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# ./nm-run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Meta-Llama-3-8B-Instruct-FP8-KV -b 32 -l 250 -f 5 -t 1
model_name: "neuralmagic/Meta-Llama-3-8B-Instruct-FP8-KV"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.764
- name: "exact_match,flexible-extract"
value: 0.764
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .github/lm-eval-configs/models/Meta-Llama-3-8B-Instruct-FP8.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# ./nm-run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Meta-Llama-3-8B-Instruct-FP8 -b 32 -l 250 -f 5 -t 1
model_name: "neuralmagic/Meta-Llama-3-8B-Instruct-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.744
- name: "exact_match,flexible-extract"
value: 0.740
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .github/lm-eval-configs/models/Meta-Llama-3-8B-Instruct-W4A16.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# ./nm-run-lm-eval-gsm-hf-baseline.sh -m TechxGenus/Meta-Llama-3-8B-Instruct-GPTQ -b 32 -l 250 -f 5
model_name: "TechxGenus/Meta-Llama-3-8B-Instruct-GPTQ"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.684
- name: "exact_match,flexible-extract"
value: 0.688
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .github/lm-eval-configs/models/Meta-Llama-3-8B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# ./nm-run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-8B-Instruct -b 32 -l 250 -f 5
model_name: "meta-llama/Meta-Llama-3-8B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.74
- name: "exact_match,flexible-extract"
value: 0.74
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .github/lm-eval-configs/models/Mistral-7B-Instruct-v0.3.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# ./nm-run-lm-eval-gsm-hf-baseline.sh -m mistralai/Mistral-7B-Instruct-v0.3 -b 32 -l 250 -f 5
model_name: "mistralai/Mistral-7B-Instruct-v0.3"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.524
- name: "exact_match,flexible-extract"
value: 0.524
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# ./nm-run-lm-eval-gsm-hf-baseline.sh -m mistralai/Mixtral-8x22B-Instruct-v0.1 -b 32 -l 250 -f 5
model_name: "mistralai/Mixtral-8x22B-Instruct-v0.1"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.840
- name: "exact_match,flexible-extract"
value: 0.844
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .github/lm-eval-configs/models/Mixtral-8x22B-Instruct-v0.1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# ./nm-run-lm-eval-gsm-hf-baseline.sh -m mistralai/Mixtral-8x22B-Instruct-v0.1 -b 32 -l 250 -f 5
model_name: "mistralai/Mixtral-8x22B-Instruct-v0.1"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.876
- name: "exact_match,flexible-extract"
value: 0.880
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .github/lm-eval-configs/models/Mixtral-8x7B-Instruct-v0.1-FP8.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# ./nm-run-lm-eval-gsm-hf-baseline.sh -m mistralai/Mixtral-8x7B-Instruct-v0.1 -b 32 -l 250 -f 5
model_name: "mistralai/Mixtral-8x7B-Instruct-v0.1"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.616
- name: "exact_match,flexible-extract"
value: 0.620
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .github/lm-eval-configs/models/Mixtral-8x7B-Instruct-v0.1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash ./nm-run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 -b 32 -l 250 -f 5 -t 4
model_name: "mistralai/Mixtral-8x7B-Instruct-v0.1"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.616
- name: "exact_match,flexible-extract"
value: 0.628
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .github/lm-eval-configs/models/Phi-3-medium-4k-instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# ./nm-run-lm-eval-gsm-hf-baseline.sh -m microsoft/Phi-3-medium-4k-instruct -b 16 -l 250 -f 5
model_name: "microsoft/Phi-3-medium-4k-instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.840
- name: "exact_match,flexible-extract"
value: 0.852
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .github/lm-eval-configs/models/Qwen2-57B-A14B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# ./nm-run-lm-eval-gsm-hf-baseline.sh -m Qwen/Qwen2-57B-A14B-Instruct -b 32 -l 250 -f 5
model_name: "Qwen/Qwen2-57B-A14B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.736
- name: "exact_match,flexible-extract"
value: 0.800
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .github/lm-eval-configs/models/Qwen2-72B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# ./nm-run-lm-eval-gsm-hf-baseline.sh -m Qwen/Qwen2-72B-Instruct -b 16 -l 250 -f 5
model_name: "Qwen/Qwen2-72B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.828
- name: "exact_match,flexible-extract"
value: 0.856
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .github/lm-eval-configs/models/Qwen2-7B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# ./nm-run-lm-eval-gsm-hf-baseline.sh -m Qwen/Qwen2-7B-Instruct -b 32 -l 250 -f 5
model_name: "Qwen/Qwen2-7B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.680
- name: "exact_match,flexible-extract"
value: 0.756
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .github/lm-eval-configs/models/gemma-7b-it.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# ./nm-run-lm-eval-gsm-hf-baseline.sh -m google/gemma-7b-it -b 16 -l 250 -f 5
model_name: "google/gemma-7b-it"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.284
- name: "exact_match,flexible-extract"
value: 0.324
limit: 250
num_fewshot: 5
2 changes: 2 additions & 0 deletions .github/lm-eval-configs/smoke-large-models.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Meta-Llama-3-70B-Instruct.yaml
Mixtral-8x7B-Instruct-v0.1.yaml
1 change: 1 addition & 0 deletions .github/lm-eval-configs/smoke-small-models.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Meta-Llama-3-8B-Instruct.yaml
10 changes: 3 additions & 7 deletions .github/scripts/nm-run-lm-eval-gsm-hf-baseline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,23 +14,19 @@ usage() {
echo
echo " -m - huggingface stub or local directory of the model"
echo " -b - batch size to run the evaluation at"
echo " -d - device to use (e.g. cuda, cuda:0, auto, cpu)"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out this doesn't work, need to pass parallelize=True to model args to use accelerate 🤦

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice find

echo " -l - limit number of samples to run"
echo " -f - number of fewshot samples to use"
echo
}

while getopts "m:b:d:l:f:" OPT; do
while getopts "m:b:l:f:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
b )
BATCH_SIZE="$OPTARG"
;;
d )
DEVICE="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
Expand All @@ -45,6 +41,6 @@ while getopts "m:b:d:l:f:" OPT; do
done

lm_eval --model hf \
--model_args pretrained=$MODEL \
--model_args pretrained=$MODEL,parallelize=True \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE --device $DEVICE
--batch_size $BATCH_SIZE
51 changes: 51 additions & 0 deletions .github/scripts/nm-run-lm-eval-gsm-vllm-baseline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#!/bin/bash
# We can use this script to compute baseline accuracy on GSM for vllm.
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@9516087b81a61d0e220b22cc1b75be76de23bc10

usage() {
echo``
echo "Runs lm eval harness on GSM8k using huggingface transformers."
echo "This pathway is intended to be used to create baselines for "
echo "our automated nm-test-accuracy workflow"
echo
echo "usage: ${0} <options>"
echo
echo " -m - huggingface stub or local directory of the model"
echo " -b - batch size to run the evaluation at"
echo " -l - limit number of samples to run"
echo " -f - number of fewshot samples to use"
echo " -t - tensor parallel size to run at"
echo
}

while getopts "m:b:l:f:t:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
b )
BATCH_SIZE="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
f )
FEWSHOT="$OPTARG"
;;
t )
TP_SIZE="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

lm_eval --model vllm \
--model_args pretrained=$MODEL,tensor_parallel_size=$TP_SIZE \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
38 changes: 34 additions & 4 deletions .github/scripts/nm-run-lm-eval-vllm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,19 @@
usage() {
echo``
echo "Runs lm eval harness on GSM8k using vllm server and compares to "
echo "precomputed baseline (measured by HF transformers."
echo "precomputed baseline (measured by HF transformers.)"
echo
echo "This script should be run from the /nm-vllm directory"
echo
echo "usage: ${0} <options>"
echo
echo " -c - path to the test data config (e.g. neuralmagic/lm-eval/YOUR_CONFIG.yaml)"
echo " -c - path to the test data config (e.g. .github/lm-eval-configs/small-models-smoke.txt)"
echo
}

while getopts "c:" OPT; do
SUCCESS=0

while getopts "c:t:" OPT; do
case ${OPT} in
c )
CONFIG="$OPTARG"
Expand All @@ -27,4 +31,30 @@ while getopts "c:" OPT; do
esac
done

LM_EVAL_TEST_DATA_FILE=$CONFIG pytest -v tests/accuracy/test_lm_eval_correctness.py
# Parse list of configs.
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < $CONFIG

for MODEL_CONFIG in "${MODEL_CONFIGS[@]}"
do
LOCAL_SUCCESS=0

echo "=== RUNNING MODEL: $MODEL_CONFIG ==="

MODEL_CONFIG_PATH=$PWD/.github/lm-eval-configs/models/${MODEL_CONFIG}
LM_EVAL_TEST_DATA_FILE=$MODEL_CONFIG_PATH pytest -s tests/accuracy/test_lm_eval_correctness.py || LOCAL_SUCCESS=$?

if [[ $LOCAL_SUCCESS == 0 ]]; then
echo "=== PASSED MODEL: ${MODEL_CONFIG} ==="
else
echo "=== FAILED MODEL: ${MODEL_CONFIG} ==="
fi

SUCCESS=$((SUCCESS + LOCAL_SUCCESS))

done

if [ "${SUCCESS}" -eq "0" ]; then
exit 0
else
exit 1
fi
2 changes: 1 addition & 1 deletion .github/workflows/nm-build-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ on:
type: string
default: "60"
lm_eval_configuration:
description: "configuration for lm-eval test (see neuralmagic/lm-eval)"
description: "configuration for lm-eval test (see .github/lm-eval-configs)"
type: string
default: ""

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/nm-nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,6 @@ jobs:
push_benchmark_results_to_gh_pages: "${{ github.event_name == 'schedule' || inputs.push_benchmark_results_to_gh_pages }}"

lm_eval_label: gcp-k8s-l4-solo
lm_eval_configuration: ./neuralmagic/lm-eval/full-small-models.yaml
lm_eval_configuration: ./.github/lm-eval-configs/smoke-small-models.txt
lm_eval_timeout: 60
secrets: inherit
2 changes: 1 addition & 1 deletion .github/workflows/nm-remote-push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,6 @@ jobs:
benchmark_timeout: 480

lm_eval_label: gcp-k8s-l4-solo
lm_eval_configuration: ./neuralmagic/lm-eval/smoke-small-models.yaml
lm_eval_configuration: ./.github/lm-eval-configs/smoke-small-models.txt
lm_eval_timeout: 60
secrets: inherit
11 changes: 0 additions & 11 deletions neuralmagic/lm-eval/full-small-models.yaml

This file was deleted.

Loading
Loading