Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolved ALIBI bias regression due to porting flat PA #503

Open
wants to merge 1 commit into
base: habana_main
Choose a base branch
from

Conversation

tannervoas742
Copy link

@tannervoas742 tannervoas742 commented Nov 15, 2024

Requires associated changes on vllm-hpu-extension PR

Changes:

  • Added back alibi biases to decode stage.
  • Optimized ALiBI memory usage.
    • Added environment variable "VLLM_PROMPT_ALIBI_MAX_SEQ_LEN" to allow
      large models to run with restricted prompt lengths.
    • Prompt biases instantiated once in init rather than each
      forward.
    • Prompt and decode biases are shared across encoder/decoder layers.
  • Added environment variable "VLLM_ALIBI_USE_FLOAT32_BIASES" to resolve
    accuracy issue on long sequences.
  • Updated jais, mpt, falcon, baichuan, and bloom to work with ALiBI.
    • Due to bloom's 176B parameter size I was unable to test this model.
      Its changes are the simplest though.
  • Works in lazy and eager mode.
  • ALiBI is restricted to "VLLM_PROMPT_USE_FUSEDSDPA=false", and
    "VLLM_CONTIGUOUS_PA=true".
  • Add position offsets to improve quality on BS > 1 with sequences of
    varying length.
  • BS > 1 may have accuracy issues if on FW < 1.19.0. This is due to
    limitation in softmax. Resolved on FW >= 1.19.0.
  • NTT patch for GQA

Co-authored-by: Tanner Voas [email protected]
Co-authored-by: Haihao Xiang [email protected]
Signed-off-by: Tanner Voas [email protected]

@tannervoas742 tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch 6 times, most recently from 4a0674d to 3959126 Compare November 18, 2024 10:33
@tannervoas742 tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch 4 times, most recently from b339767 to 3c3e18a Compare November 27, 2024 03:05
@tannervoas742 tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch from 3c3e18a to 6c19183 Compare November 28, 2024 01:22
@zhouyuan zhouyuan mentioned this pull request Nov 28, 2024
@tannervoas742 tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch from 6c19183 to 3cb455d Compare December 5, 2024 19:23
@tannervoas742
Copy link
Author

@itaraban @madamczykhabana @kzawora-intel has anyone gotten a chance to review this PR and the associated one on vllm-hpu-extension. I just pushed out a significant update that minimizes changes to non-alibi code sections. It also has significant accuracy and memory optimization changes.

With the current changes ALiBi is now fully functional as long as FW >= 1.19.0 is being used.

Please help review. Any feedback would be appreciated.

@tannervoas742 tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch 3 times, most recently from 49fcaaa to 64822b0 Compare December 10, 2024 16:16
@tannervoas742 tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch from 64822b0 to 684384e Compare December 11, 2024 15:04
vllm/worker/hpu_model_runner.py Outdated Show resolved Hide resolved
vllm/attention/backends/hpu_attn.py Show resolved Hide resolved
@tannervoas742 tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch 3 times, most recently from 214885e to d3fa482 Compare December 12, 2024 20:01
@tannervoas742
Copy link
Author

@michalkuligowski I have fixed the static code analysis issue as well as updated requirements-hpu.txt

Copy link

@michalkuligowski michalkuligowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannervoas742 there are still some issues detected, please check (you can try runing format.sh script):
Error: vllm/attention/layer.py:99: error: Too many arguments for "AttentionImpl" [call-arg]
Error: vllm/attention/backends/hpu_attn.py:279: error: Value of type "Optional[Any]" is not indexable [index]
Error: vllm/attention/backends/hpu_attn.py:291: error: Item "None" of "Optional[Any]" has no attribute "unsqueeze" [union-attr]

@tannervoas742
Copy link
Author

@michalkuligowski I see the issues now. I wasn't sure where to view the static code analysis report, but found it. I pushed out an update. Waiting for the code analysis to run again. Will reply here when it's finished and ready for re-review.

@tannervoas742 tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch 3 times, most recently from 9fac2b5 to ba971fd Compare December 16, 2024 23:37
@tannervoas742
Copy link
Author

@itaraban @michalkuligowski I have updated the PR and ran the script in tools/mypy.sh which passes locally. I also tested the updated version with various ALiBi and non-alibi models. Please re-review. I opened the extension PR again as well. HabanaAI/vllm-hpu-extension#60

@tannervoas742 tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch from ba971fd to b937caf Compare December 17, 2024 17:48
Copy link

@kwisniewski98 kwisniewski98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest issue I have right now is that modifying any file that isn't hpu specific (models, attention backends) will cause it to be hard/impossible to upstream. I didn't want to repeat comment for each file, but I think that changes should be removed from all of them.

vllm/attention/backends/abstract.py Outdated Show resolved Hide resolved
vllm/attention/backends/hpu_attn.py Outdated Show resolved Hide resolved
vllm/model_executor/models/baichuan.py Outdated Show resolved Hide resolved
vllm/model_executor/models/baichuan.py Outdated Show resolved Hide resolved
vllm/attention/layer.py Outdated Show resolved Hide resolved
vllm/worker/hpu_model_runner.py Show resolved Hide resolved
@tannervoas742 tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch from b937caf to 143f7c6 Compare January 14, 2025 03:48
@tannervoas742
Copy link
Author

@kwisniewski98 I refined this PR with only hpu files being changed. I also have rebased this and the extension PR (HabanaAI/vllm-hpu-extension#60) off the latests mains. Tested with several ALiBi and non-alibi models. And local mypy runs showed no new mypy errors.

Please help re-review.

Copy link

@kwisniewski98 kwisniewski98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just last small comment. We will merge HabanaAI/vllm-hpu-extension#70 probably tomorrow, after that you will have to change sha of vllm-hpu-extension in requirements-hpu.txt

vllm/attention/backends/hpu_attn.py Outdated Show resolved Hide resolved
@tannervoas742 tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch from 143f7c6 to 787d66c Compare January 14, 2025 14:17
@tannervoas742
Copy link
Author

Just last small comment. We will merge HabanaAI/vllm-hpu-extension#70 probably tomorrow, after that you will have to change sha of vllm-hpu-extension in requirements-hpu.txt

Understood. I fixed the small issue you mentioned. Will update this PR with the extension sha after that has merged.

Copy link

@michalkuligowski michalkuligowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the conflicts and static code analisys issues

@tannervoas742 tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch from 787d66c to 1c63b12 Compare January 20, 2025 13:26
@tannervoas742 tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch from 1c63b12 to 2d7b0a3 Compare January 20, 2025 13:32
@tannervoas742
Copy link
Author

@michalkuligowski @kwisniewski98 conflicts have been resolved. Yapf and ruff issues should also be resolved now.

Changes:
- Added back alibi biases to decode stage.
- Optimized ALiBI memory usage.
  - Added environment variable "VLLM_PROMPT_ALIBI_MAX_SEQ_LEN" to allow
    large models to run with restricted prompt lengths.
  - Prompt biases instantiated once rather than each forward.
  - Prompt and decode biases are shared across encoder/decoder layers.
- Added environment variable "VLLM_ALIBI_USE_FLOAT32_BIASES" to resolve
  accuracy issue on long sequences.
- Works in lazy and eager mode.
- ALiBI is restricted to "VLLM_PROMPT_USE_FUSEDSDPA=false", and
  "VLLM_CONTIGUOUS_PA=true".
- NTT patch for GQA

Co-authored-by: Tanner Voas <[email protected]>
Co-authored-by: Haihao Xiang <[email protected]>
Signed-off-by: Tanner Voas <[email protected]>
@tannervoas742 tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch from 2d7b0a3 to ec99176 Compare January 22, 2025 02:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants