update get_split_k to fix a performance regression on FA decode #1040

scxiao · 2024-04-30T18:50:31Z

What does this PR do?

Fixes a performance regress for FA decode. Changes are related to the function get_split_k(), which is to compute the number of splitK in the fa decode in Triton.

This PR also added a test to verify the correctness of different implementations of FA decoders, you can run the tests as:
pytest benchmark_attn_decoding.py -v

Before submitting

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

codecov-commenter · 2024-04-30T19:08:58Z

Codecov Report

Attention: Patch coverage is 0% with 13 lines in your changes are missing coverage. Please review.

Project coverage is 59.58%. Comparing base (22d092e) to head (79da1da).
Report is 1 commits behind head on main.

Files	Patch %	Lines
xformers/ops/fmha/triton_splitk.py	0.00%	13 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1040      +/-   ##
==========================================
- Coverage   59.64%   59.58%   -0.06%     
==========================================
  Files         114      114              
  Lines       10223    10232       +9     
==========================================
  Hits         6097     6097              
- Misses       4126     4135       +9

Flag	Coverage Δ
Python	`59.58% <0.00%> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

xw285cornell · 2024-05-08T01:17:45Z

@scxiao thanks! Do you have the performance results (on both AMD and H100)?

jianyuh

Just confirming: this PR will not regress H100 perf?

jianyuh · 2024-05-08T08:08:51Z

xformers/ops/fmha/triton_splitk.py

@@ -1028,19 +1028,31 @@ def not_supported_reasons(cls, d: Inputs) -> List[str]:
    @classmethod
    def get_split_k(cls, B: int, G: int, H: int, Mk: int) -> int:
        """Heuristic for the number of splits"""
+        print(f"B = {B}, G = {G}, H = {H}, Mk = {Mk}")


nit: remove.

Sorry, it seems like I pushed an unexpected commit to this PR. I just convert this PR to draft. will let you know when I cleaned up the code.

shagunsodhani · 2024-05-08T12:30:30Z

xformers/ops/fmha/triton_splitk.py

            split_k_upper_bound = 512
        else:
            max_chunk_size = 64 if Mk <= 512 and bh <= 64 else 128
            split_k_stop_val = Mk / max_chunk_size
            split_k_upper_bound = 64

-        while split_k > split_k_stop_val:
-            split_k = split_k // 2
+            while split_k > split_k_stop_val:


Confirming if that this is an intended change ?

Sorry, it seems like I pushed an unexpected commit to this PR. I just convert this PR to draft. will let you know when I cleaned up the code.

Just double checked it, we try to make the changes only impact the hip side, so make the above changes here.

scxiao · 2024-05-08T16:54:13Z

@scxiao thanks! Do you have the performance results (on both AMD and H100)?

I will check and get back to you

scxiao · 2024-05-08T16:54:31Z

@scxiao thanks! Do you have the performance results (on both AMD and H100)?

Thanks. I will check and get back to you.

scxiao · 2024-05-13T19:27:09Z

Just confirming: this PR will not regress H100 perf?

Changes in this PR only applies to the calculation of split_k in hip backend, so it will not impact cuda side.

scxiao · 2024-05-14T02:34:00Z

HI @jianyuh, @shagunsodhani, when you get a chance, could you please take a look at this PR? Thanks

zixi-qi · 2024-05-14T03:34:27Z

xformers/ops/fmha/triton_splitk.py

        if torch.version.hip:
            max_chunk_size = 64
-            split_k_stop_val = min(Mk / max_chunk_size, 1024 / (B * G * H))
+            split_k_stop_val = 1024 / (B * G * H)
+            while split_k > 0 and Mk / (split_k - 1) < max_chunk_size:


Should it be while split_k > 1 to prevent division by 0?

Thanks, yes, good catch, should be split_k > 1.

bottler · 2024-05-14T08:23:38Z

LGTM. This is a small change on nvidia (adding bh-1), and a larger change on amd.

scxiao · 2024-05-14T17:40:17Z

LGTM. This is a small change on nvidia (adding bh-1), and a larger change on amd.

Thanks, I reverted that change related to nvidia.

scxiao · 2024-05-16T13:23:29Z

Hi all, I am wondering whether anyone has additional comments for this PR? If not, could we get this PR merged? Thanks.

zixi-qi

LGTM, change only affects AMD and verified internally in D57316421

scxiao · 2024-05-17T14:14:48Z

Hi @jianyuh, could you please help get this PR merged if no additional comments? Thanks

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 30, 2024

jianyuh reviewed May 8, 2024

View reviewed changes

shagunsodhani reviewed May 8, 2024

View reviewed changes

scxiao marked this pull request as draft May 8, 2024 16:51

scxiao added 9 commits May 13, 2024 17:58

update get_split_k to fix a performance regression

b49659b

more optimization for get_split_k() function

1861c21

futher refinement of the split_size calculation

ead8eaf

backup changes

d032582

backup code changes

c6089b1

backup changes

f90881c

address review comments and code cleanup

5a390db

backup changes

e176b2a

code cleanup

1e47433

scxiao force-pushed the update_get_splitK branch from 5ece4ca to 1e47433 Compare May 13, 2024 19:19

revert an unnecessary change

2a48522

scxiao marked this pull request as ready for review May 13, 2024 19:22

revert another unnecessary change

425d767

fix format

d302e8d

zixi-qi reviewed May 14, 2024

View reviewed changes

sgrigory requested a review from bottler May 14, 2024 08:15

bottler approved these changes May 14, 2024

View reviewed changes

scxiao added 2 commits May 14, 2024 13:37

revert a change of get_split_k() related to cuda triton fa decode

af7aa40

address a review comment

55e8911

scxiao added 3 commits May 14, 2024 15:03

fix format

11547c0

black format

d3316fc

revert a format change

79da1da

zixi-qi approved these changes May 16, 2024

View reviewed changes

bottler merged commit 6e1718b into facebookresearch:main May 17, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update get_split_k to fix a performance regression on FA decode #1040

update get_split_k to fix a performance regression on FA decode #1040

scxiao commented Apr 30, 2024 •

edited

Loading

codecov-commenter commented Apr 30, 2024 •

edited

Loading

xw285cornell commented May 8, 2024

jianyuh left a comment

jianyuh May 8, 2024

scxiao May 8, 2024

shagunsodhani May 8, 2024 •

edited

Loading

scxiao May 8, 2024

scxiao May 13, 2024

scxiao commented May 8, 2024

scxiao commented May 8, 2024

scxiao commented May 13, 2024

scxiao commented May 14, 2024

zixi-qi May 14, 2024

scxiao May 14, 2024

bottler commented May 14, 2024

scxiao commented May 14, 2024

scxiao commented May 16, 2024

zixi-qi left a comment

scxiao commented May 17, 2024

update get_split_k to fix a performance regression on FA decode #1040

update get_split_k to fix a performance regression on FA decode #1040

Conversation

scxiao commented Apr 30, 2024 • edited Loading

What does this PR do?

Before submitting

PR review

codecov-commenter commented Apr 30, 2024 • edited Loading

Codecov Report

xw285cornell commented May 8, 2024

jianyuh left a comment

Choose a reason for hiding this comment

jianyuh May 8, 2024

Choose a reason for hiding this comment

scxiao May 8, 2024

Choose a reason for hiding this comment

shagunsodhani May 8, 2024 • edited Loading

Choose a reason for hiding this comment

scxiao May 8, 2024

Choose a reason for hiding this comment

scxiao May 13, 2024

Choose a reason for hiding this comment

scxiao commented May 8, 2024

scxiao commented May 8, 2024

scxiao commented May 13, 2024

scxiao commented May 14, 2024

zixi-qi May 14, 2024

Choose a reason for hiding this comment

scxiao May 14, 2024

Choose a reason for hiding this comment

bottler commented May 14, 2024

scxiao commented May 14, 2024

scxiao commented May 16, 2024

zixi-qi left a comment

Choose a reason for hiding this comment

scxiao commented May 17, 2024

scxiao commented Apr 30, 2024 •

edited

Loading

codecov-commenter commented Apr 30, 2024 •

edited

Loading

shagunsodhani May 8, 2024 •

edited

Loading