-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolved ALIBI bias regression due to porting flat PA #503
base: habana_main
Are you sure you want to change the base?
Resolved ALIBI bias regression due to porting flat PA #503
Conversation
4a0674d
to
3959126
Compare
b339767
to
3c3e18a
Compare
3c3e18a
to
6c19183
Compare
6c19183
to
3cb455d
Compare
@itaraban @madamczykhabana @kzawora-intel has anyone gotten a chance to review this PR and the associated one on vllm-hpu-extension. I just pushed out a significant update that minimizes changes to non-alibi code sections. It also has significant accuracy and memory optimization changes. With the current changes ALiBi is now fully functional as long as FW >= 1.19.0 is being used. Please help review. Any feedback would be appreciated. |
49fcaaa
to
64822b0
Compare
64822b0
to
684384e
Compare
214885e
to
d3fa482
Compare
@michalkuligowski I have fixed the static code analysis issue as well as updated requirements-hpu.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannervoas742 there are still some issues detected, please check (you can try runing format.sh script):
Error: vllm/attention/layer.py:99: error: Too many arguments for "AttentionImpl" [call-arg]
Error: vllm/attention/backends/hpu_attn.py:279: error: Value of type "Optional[Any]" is not indexable [index]
Error: vllm/attention/backends/hpu_attn.py:291: error: Item "None" of "Optional[Any]" has no attribute "unsqueeze" [union-attr]
@michalkuligowski I see the issues now. I wasn't sure where to view the static code analysis report, but found it. I pushed out an update. Waiting for the code analysis to run again. Will reply here when it's finished and ready for re-review. |
9fac2b5
to
ba971fd
Compare
@itaraban @michalkuligowski I have updated the PR and ran the script in tools/mypy.sh which passes locally. I also tested the updated version with various ALiBi and non-alibi models. Please re-review. I opened the extension PR again as well. HabanaAI/vllm-hpu-extension#60 |
ba971fd
to
b937caf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest issue I have right now is that modifying any file that isn't hpu specific (models, attention backends) will cause it to be hard/impossible to upstream. I didn't want to repeat comment for each file, but I think that changes should be removed from all of them.
b937caf
to
143f7c6
Compare
@kwisniewski98 I refined this PR with only hpu files being changed. I also have rebased this and the extension PR (HabanaAI/vllm-hpu-extension#60) off the latests mains. Tested with several ALiBi and non-alibi models. And local mypy runs showed no new mypy errors. Please help re-review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just last small comment. We will merge HabanaAI/vllm-hpu-extension#70 probably tomorrow, after that you will have to change sha of vllm-hpu-extension in requirements-hpu.txt
143f7c6
to
787d66c
Compare
Understood. I fixed the small issue you mentioned. Will update this PR with the extension sha after that has merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the conflicts and static code analisys issues
787d66c
to
1c63b12
Compare
1c63b12
to
2d7b0a3
Compare
@michalkuligowski @kwisniewski98 conflicts have been resolved. Yapf and ruff issues should also be resolved now. |
Changes: - Added back alibi biases to decode stage. - Optimized ALiBI memory usage. - Added environment variable "VLLM_PROMPT_ALIBI_MAX_SEQ_LEN" to allow large models to run with restricted prompt lengths. - Prompt biases instantiated once rather than each forward. - Prompt and decode biases are shared across encoder/decoder layers. - Added environment variable "VLLM_ALIBI_USE_FLOAT32_BIASES" to resolve accuracy issue on long sequences. - Works in lazy and eager mode. - ALiBI is restricted to "VLLM_PROMPT_USE_FUSEDSDPA=false", and "VLLM_CONTIGUOUS_PA=true". - NTT patch for GQA Co-authored-by: Tanner Voas <[email protected]> Co-authored-by: Haihao Xiang <[email protected]> Signed-off-by: Tanner Voas <[email protected]>
2d7b0a3
to
ec99176
Compare
Requires associated changes on vllm-hpu-extension PR
Changes:
large models to run with restricted prompt lengths.
forward.
accuracy issue on long sequences.
Its changes are the simplest though.
"VLLM_CONTIGUOUS_PA=true".
varying length.
limitation in softmax. Resolved on FW >= 1.19.0.
Co-authored-by: Tanner Voas [email protected]
Co-authored-by: Haihao Xiang [email protected]
Signed-off-by: Tanner Voas [email protected]