-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] update should_ignore_layer #11354
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
@robertgshaw2-neuralmagic |
Signed-off-by: George Ohashi <[email protected]>
877d522
to
41f5a9a
Compare
Could you share an example of a config where it fails before this change? I don't understand why we want to short-circuit the logic in this function even if some layers are ignored because what if some shards are ignored and others aren't. Basically I want to make sure we still will hit this error message # If shard_idx=1+ confirm scheme matches prior shards.
elif should_ignore_shard != should_ignore_layer:
raise ValueError(f"Found a different quantization schemes for "
f"{shard_proj_names} in {layer_name}. vLLM "
"requires all to use the same scheme.") |
Sure this is an example on llmcompressor from datasets import load_dataset
from loguru import logger
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.transformers import oneshot
# Select model and load it.
MODEL_ID = "microsoft/Phi-3-mini-4k-instruct"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map="auto",
torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# Select calibration dataset.
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"
# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048
# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
def process_and_tokenize(example):
text = tokenizer.apply_chat_template(example["messages"], tokenize=False)
return tokenizer(
text,
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
)
ds = ds.map(process_and_tokenize, remove_columns=ds.column_names)
recipe = """
quant_stage:
quant_modifiers:
QuantizationModifier:
ignore: ["lm_head"]
kv_cache_scheme:
{num_bits: 8, type: float, symmetric: true, strategy: tensor}
"""
# Apply algorithms.
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
# logger.info(
# "Running sample generation. ",
# "Note: Inference with the quantized kv_cache is not supported. ",
# "Please use vLLM for inference with the quantized kv_cache.",
# )
# # Confirm generations of the quantized model look sane.
# print("\n\n")
# print("========== SAMPLE GENERATION ==============")
# input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
# output = model.generate(input_ids, max_new_tokens=100)
# print(tokenizer.decode(output[0]))
# print("==========================================\n\n")
# Save to disk compressed.
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-KV-only"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
from vllm import LLM
import torch
from vllm import LLM, SamplingParams
sampling_params = SamplingParams(temperature=0.80, top_p=0.95)
llm = LLM(model=SAVE_DIR)
outputs = llm.generate("Hello my name is", sampling_params)
print(outputs) |
Simple script from vllm import LLM, SamplingParams
sampling_params = SamplingParams(temperature=0.80, top_p=0.95)
# https://huggingface.co/horheynm/Phi-3-mini-4k-instruct-kv_cache/blob/main/config.json
path = "horheynm/Phi-3-mini-4k-instruct-kv_cache"
llm = LLM(model=path)
outputs = llm.generate("Hello my name is", sampling_params)
print(outputs[0].outputs[0].text)
|
I see so this is to work around an unrelated issue to quantization of fused layers. LGTM then, thanks! |
Signed-off-by: George Ohashi <[email protected]>
~~Contingent on merge of vllm-project/vllm#11354 ^ merged SUMMARY: Add kv-cache e2e testing * One small model - tinyllama - with kv-cache * One small model - tinyllama - with kv-cache + gptq * Fused Model - phi3 - with kv-cache
~~Contingent on merge of vllm-project/vllm#11354 ^ merged SUMMARY: Add kv-cache e2e testing * One small model - tinyllama - with kv-cache * One small model - tinyllama - with kv-cache + gptq * Fused Model - phi3 - with kv-cache Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: George Ohashi <[email protected]>
Signed-off-by: George Ohashi <[email protected]>
FIX ignore logic in Compressed Tensors utils.
Previously for fused layers, it automatically uses the shared_proj_names logic without considering the ignore list provided by the input arg.
Fix considers the ignore list for fused layers
Tested with this previously failing model: