-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4-bit quant? #28
Comments
Hey @nmandic78! I have applied 8-bit quantization to deepseek-vl-1.3b-chat which is the smallest model. I have a dummy repository where I throw all of my experiments. Unfortunately I haven't been able to push the model to the hub because serialization of the weights is not fully supported but I think you can do it in a different way. Here is a link to the notebook where I do the quantization, it's quite short. |
@jucamohedano, thank you for info. I'll take a look at quanto. On first glance I see they still have some issues with optimization ('latency: all models are at least 2x slower than the 16-bit models due to the lack of optimized kernels (for now).'). |
@nmandic78 You can edit a few things and use BitsAndBytesConfig from transformers and load it in 4bit mode. Since my GPU is a 2080 and doesn't support bfloat16, I had to edit all the other deepseek *.py files and change any bfloat16 -> float16. If you're on a 3XXX or 4XXX Nvidia card, you don't have to and can stick with bfloat16.
This works just fine for me and stays within the 8GB limit of my GPU. The results are still accurate from what I can tell(you can ignore the flash attention warning). |
@RandomGitUser321, thank you! |
@RandomGitUser321 just short update, and if anyone else stumbles on this. I have 3090 so can do bfloat16, but I had to convert every layer .to(torch.bfloat16) to make it work. It was tedious following errors, but after converting them all it works just fine. My VRAM usage is higher (8.8GB, 370MB before model load). Ubuntu 22.04, NVIDIA 545.23 And one more observation. I tried with both FP4 and NF4 bnb_4bit_quant_type, and for same inputs (image + prompt) it looks FP4 answer is more detailed. |
Hi! Thank you for releasing this multimodal model. First test are impressive. Even 1.3B is good for its size.
It is just that 7b version in full precision is still taxing on personal HW we have at home.
Would it be possible to quantize it to int4 like Qwen did with their Qwen-VL-Chat-Int4?
I think it would be best if you could do it and put it in your HF repo so community can use it.
If not, maybe you could give us some guidelines how to do it.
The text was updated successfully, but these errors were encountered: