Resolved ALIBI bias regression due to porting flat PA #503

tannervoas742 · 2024-11-15T05:44:13Z

Requires associated changes on vllm-hpu-extension PR

Changes:

Added back alibi biases to decode stage.
Optimized ALiBI memory usage.
- Added environment variable "VLLM_PROMPT_ALIBI_MAX_SEQ_LEN" to allow
  large models to run with restricted prompt lengths.
- Prompt biases instantiated once in init rather than each
  forward.
- Prompt and decode biases are shared across encoder/decoder layers.
Added environment variable "VLLM_ALIBI_USE_FLOAT32_BIASES" to resolve
accuracy issue on long sequences.
Updated jais, mpt, falcon, baichuan, and bloom to work with ALiBI.
- Due to bloom's 176B parameter size I was unable to test this model.
  Its changes are the simplest though.
Works in lazy and eager mode.
ALiBI is restricted to "VLLM_PROMPT_USE_FUSEDSDPA=false", and
"VLLM_CONTIGUOUS_PA=true".
Add position offsets to improve quality on BS > 1 with sequences of
varying length.
BS > 1 may have accuracy issues if on FW < 1.19.0. This is due to
limitation in softmax. Resolved on FW >= 1.19.0.
NTT patch for GQA

Co-authored-by: Tanner Voas [email protected]
Co-authored-by: Haihao Xiang [email protected]
Signed-off-by: Tanner Voas [email protected]

tannervoas742 · 2024-12-05T19:34:26Z

@itaraban @madamczykhabana @kzawora-intel has anyone gotten a chance to review this PR and the associated one on vllm-hpu-extension. I just pushed out a significant update that minimizes changes to non-alibi code sections. It also has significant accuracy and memory optimization changes.

With the current changes ALiBi is now fully functional as long as FW >= 1.19.0 is being used.

Please help review. Any feedback would be appreciated.

vllm/worker/hpu_model_runner.py

vllm/attention/backends/hpu_attn.py

tannervoas742 · 2024-12-12T20:09:47Z

@michalkuligowski I have fixed the static code analysis issue as well as updated requirements-hpu.txt

michalkuligowski

@tannervoas742 there are still some issues detected, please check (you can try runing format.sh script):
Error: vllm/attention/layer.py:99: error: Too many arguments for "AttentionImpl" [call-arg]
Error: vllm/attention/backends/hpu_attn.py:279: error: Value of type "Optional[Any]" is not indexable [index]
Error: vllm/attention/backends/hpu_attn.py:291: error: Item "None" of "Optional[Any]" has no attribute "unsqueeze" [union-attr]

tannervoas742 · 2024-12-13T17:59:38Z

@michalkuligowski I see the issues now. I wasn't sure where to view the static code analysis report, but found it. I pushed out an update. Waiting for the code analysis to run again. Will reply here when it's finished and ready for re-review.

tannervoas742 · 2024-12-17T02:11:30Z

@itaraban @michalkuligowski I have updated the PR and ran the script in tools/mypy.sh which passes locally. I also tested the updated version with various ALiBi and non-alibi models. Please re-review. I opened the extension PR again as well. HabanaAI/vllm-hpu-extension#60

kwisniewski98

The biggest issue I have right now is that modifying any file that isn't hpu specific (models, attention backends) will cause it to be hard/impossible to upstream. I didn't want to repeat comment for each file, but I think that changes should be removed from all of them.

vllm/attention/backends/abstract.py

vllm/attention/backends/hpu_attn.py

vllm/model_executor/models/baichuan.py

vllm/attention/layer.py

vllm/worker/hpu_model_runner.py

tannervoas742 · 2025-01-14T03:52:14Z

@kwisniewski98 I refined this PR with only hpu files being changed. I also have rebased this and the extension PR (HabanaAI/vllm-hpu-extension#60) off the latests mains. Tested with several ALiBi and non-alibi models. And local mypy runs showed no new mypy errors.

Please help re-review.

kwisniewski98

Just last small comment. We will merge HabanaAI/vllm-hpu-extension#70 probably tomorrow, after that you will have to change sha of vllm-hpu-extension in requirements-hpu.txt

vllm/attention/backends/hpu_attn.py

tannervoas742 · 2025-01-14T14:19:07Z

Just last small comment. We will merge HabanaAI/vllm-hpu-extension#70 probably tomorrow, after that you will have to change sha of vllm-hpu-extension in requirements-hpu.txt

Understood. I fixed the small issue you mentioned. Will update this PR with the extension sha after that has merged.

michalkuligowski

Please fix the conflicts and static code analisys issues

tannervoas742 · 2025-01-20T13:33:54Z

@michalkuligowski @kwisniewski98 conflicts have been resolved. Yapf and ruff issues should also be resolved now.

Changes: - Added back alibi biases to decode stage. - Optimized ALiBI memory usage. - Added environment variable "VLLM_PROMPT_ALIBI_MAX_SEQ_LEN" to allow large models to run with restricted prompt lengths. - Prompt biases instantiated once rather than each forward. - Prompt and decode biases are shared across encoder/decoder layers. - Added environment variable "VLLM_ALIBI_USE_FLOAT32_BIASES" to resolve accuracy issue on long sequences. - Works in lazy and eager mode. - ALiBI is restricted to "VLLM_PROMPT_USE_FUSEDSDPA=false", and "VLLM_CONTIGUOUS_PA=true". - NTT patch for GQA Co-authored-by: Tanner Voas <[email protected]> Co-authored-by: Haihao Xiang <[email protected]> Signed-off-by: Tanner Voas <[email protected]>

tannervoas742 mentioned this pull request Nov 15, 2024

Resolved ALIBI bias regression due to porting flat PA HabanaAI/vllm-hpu-extension#34

Merged

tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch 6 times, most recently from 4a0674d to 3959126 Compare November 18, 2024 10:33

michalkuligowski requested review from madamczykhabana and kzawora-intel November 19, 2024 08:54

tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch 4 times, most recently from b339767 to 3c3e18a Compare November 27, 2024 03:05

tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch from 3c3e18a to 6c19183 Compare November 28, 2024 01:22

zhouyuan mentioned this pull request Nov 28, 2024

Enable alibi fusedsdpa #561

Merged

mgawarkiewicz requested a review from itaraban December 2, 2024 09:55

tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch from 6c19183 to 3cb455d Compare December 5, 2024 19:23

tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch 3 times, most recently from 49fcaaa to 64822b0 Compare December 10, 2024 16:16

tannervoas742 requested review from michalkuligowski and mgawarkiewicz as code owners December 10, 2024 16:16

tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch from 64822b0 to 684384e Compare December 11, 2024 15:04

itaraban approved these changes Dec 12, 2024

View reviewed changes

michalkuligowski requested changes Dec 12, 2024

View reviewed changes

vllm/worker/hpu_model_runner.py Outdated Show resolved Hide resolved

vllm/attention/backends/hpu_attn.py Show resolved Hide resolved

tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch 3 times, most recently from 214885e to d3fa482 Compare December 12, 2024 20:01

tannervoas742 requested a review from michalkuligowski December 12, 2024 20:08

michalkuligowski approved these changes Dec 13, 2024

View reviewed changes

michalkuligowski requested changes Dec 13, 2024

View reviewed changes

michalkuligowski mentioned this pull request Dec 13, 2024

Revert "Resolved ALIBI bias regression due to porting flat PA" HabanaAI/vllm-hpu-extension#59

Merged

tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch 3 times, most recently from 9fac2b5 to ba971fd Compare December 16, 2024 23:37

tannervoas742 requested review from itaraban and michalkuligowski December 17, 2024 02:09

tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch from ba971fd to b937caf Compare December 17, 2024 17:48

kwisniewski98 requested changes Jan 9, 2025

View reviewed changes

tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch from b937caf to 143f7c6 Compare January 14, 2025 03:48

tannervoas742 requested a review from vivekgoe as a code owner January 14, 2025 03:48

tannervoas742 requested a review from kwisniewski98 January 14, 2025 03:50

kwisniewski98 reviewed Jan 14, 2025

View reviewed changes

vllm/attention/backends/hpu_attn.py Outdated Show resolved Hide resolved

tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch from 143f7c6 to 787d66c Compare January 14, 2025 14:17

tannervoas742 requested a review from afierka-intel as a code owner January 14, 2025 14:17

michalkuligowski requested changes Jan 20, 2025

View reviewed changes

tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch from 787d66c to 1c63b12 Compare January 20, 2025 13:26

tannervoas742 requested review from kwisniewski98 and michalkuligowski January 20, 2025 13:27

tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch from 1c63b12 to 2d7b0a3 Compare January 20, 2025 13:32

tannervoas742 force-pushed the restore_alibi_for_flat_pa_final branch from 2d7b0a3 to ec99176 Compare January 22, 2025 02:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolved ALIBI bias regression due to porting flat PA #503

Resolved ALIBI bias regression due to porting flat PA #503

tannervoas742 commented Nov 15, 2024 •

edited by github-actions bot

Loading

tannervoas742 commented Dec 5, 2024

tannervoas742 commented Dec 12, 2024

michalkuligowski left a comment •

edited

Loading

tannervoas742 commented Dec 13, 2024

tannervoas742 commented Dec 17, 2024

kwisniewski98 left a comment

tannervoas742 commented Jan 14, 2025

kwisniewski98 left a comment

tannervoas742 commented Jan 14, 2025

michalkuligowski left a comment

tannervoas742 commented Jan 20, 2025

Resolved ALIBI bias regression due to porting flat PA #503

Are you sure you want to change the base?

Resolved ALIBI bias regression due to porting flat PA #503

Conversation

tannervoas742 commented Nov 15, 2024 • edited by github-actions bot Loading

tannervoas742 commented Dec 5, 2024

tannervoas742 commented Dec 12, 2024

michalkuligowski left a comment • edited Loading

Choose a reason for hiding this comment

tannervoas742 commented Dec 13, 2024

tannervoas742 commented Dec 17, 2024

kwisniewski98 left a comment

Choose a reason for hiding this comment

tannervoas742 commented Jan 14, 2025

kwisniewski98 left a comment

Choose a reason for hiding this comment

tannervoas742 commented Jan 14, 2025

michalkuligowski left a comment

Choose a reason for hiding this comment

tannervoas742 commented Jan 20, 2025

tannervoas742 commented Nov 15, 2024 •

edited by github-actions bot

Loading

michalkuligowski left a comment •

edited

Loading