Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Building xformers using ck-tiled as default * ensure ck_decoder does not dispatch * Add disable_on_rocm on some test scripts * Update to test_mem_eff_attention.py * apply isort * apply black * fix flake8 suggestions * add license headers and reapply black * Tiny update to rocm_ci.yml * Add conditional compiling for cuda-depending codes in ROCM * Update to benchmark scripts * Rename the one script file * Revert "Add conditional compiling for cuda-depending codes in ROCM" This reverts commit 12fb41c. * Update to scripts * Change and add readme for tests and benchmarks * Remove the stuffs for supporting old ck * Remove old composable_kernel from submodule list * Remove folder third_party/composable_kernel * Rename the folder * Remove unused script file * apply black * pacify mypy * fix clang-format * reapply black * fix lints * make test_splitk_reference run on cpu * add ck modules to docs * try fixing nvidia build by re-including sparse24 cpp folder into extension sources * update cutlass to upstream commit * update flash-attention to upstream commit * simplify setup.py * remove duplicate run_batched_infer_causalmask_attnbias_dispatched<f16, true, true, 128> * add hip version and pytorch hip arch list to xformers build info * fix build * patch around the unhappy path in get_hip_version * skip test_grad_checkpointing for triton_splitk since it doesn't have bwop * re-enable test_mqa_forward since ck tiled is the current implementation * make skip test_wrong_alignment more generic * reapply black * simplify test_decoder * put python version check inside triton_splitk op * fix logic * cleanup python3.9 checks in tests * cleanup test_attentions * cleanup test_checkpoint as test running on cpu does not depend on gpu platform * fix lints * try fixing win build by conditional import of triton in triton op * re-enable test_triton_layernorm as it passes * re-enable test_triton_blocksparse as it passes * cleanup test_sparse_tensors * cleanup test_custom_ops * reapply black * cleanup test_core_attention * benchmark ck ops on rocm only * fix mypy * fix lint: black * fix lints: mypy * split-k decoder: move all tunable parameters to the top of cpp file * apply clang-format * Rename HDim/headdim to MaxK/maxk * Move some headers files to ck examples for later reusing * Replace using qs_ks_vs pipeline by qr_ks_vs pipeline while HeadDim is 256 for better performance * rm test_ck_7 * dump kernel resource usage to compilation logs similar to nv * Add the c++ extension to the latest change of ck_tile/dev fwd kernel (added droppout) * Add the c++ extension to use ck_tile/dev/ fmha bwd kernel * Update to add dropout for fmah backward * Update in attention.cpp to align efficient_attention_backward_ck interface parameters * Enable BwdOp in ck.py * Support grad_out to have different strides as out * Force seqstart_q/seqstart_k to be in device memory in ck.py * Remove duplicated codes in ck_tiled_fmha_grouped_forward.h/infer.h * Use optimized async pipeline where 8x headdim length is assumed * Fix in batched_infer * Update to track ck_tile/opt_padding_fa_train_xformers branch * Update rocm_ci.yml configuring the self-hosted runner * Update to use the newer FmhaFwdEpilogue * Update rocm_ci.yml add option to manually trigger workflow * Update rocm_ci.yml remove condition which skips ci unless github event contains string 'rocm' * copy rocm_ci workflow from main branch * Update rocm_ci.yml Bump upload-artifact version * Update to use the newer FmhaFwdEpilogue for grouped infer/forward * Temporarily disable the using of QRKSVSAsync() pipeline * Update rocm_ci.yml add a daily run * Implement the ck_rand_uniform interface for generating random number tensor * Add dropout to the infer path (needed by xformers test_dropout) * Update to support test_dropout and test_dropout_backward tests * Update the padding method in batched_backward.h * Update the OGradDotO kernel padding method * Change the backward padding checking condition * Add batch_stride_lse/d parameters to adapt grouped mode forward/backward to [num_batches, H, MaxSeqlenQ] layout * Fill the grad_bias in advance * Add support for kHasBiasGrad as instance template * Remove using hdim_stride_do in fmha backward * Force kPadSeqLenQ/kPadSeqLenK to be true in batched-backward to save compiling time * Fix missing passing of {philox_seed, philox_offset} in inference path * Use SimplifiedGenericAttentionMask to replace GenericAttentionMask * Shorten the instance file names * Rename the template parameters * Simplify the names of the dispatch class and interfaces * Changes to reuse the kernel files under ck_tile examples/91_tile_program/fmha folder * Update test_mem_eff_attention.py for test_dropout/test_dropout_backward/test_backward on rocm * Tiny change to the philox_cuda_state input setting * Allocate logsumexp to ensure aligned access by each thread-group * Add checking for query/key headdim size attention_backward_generic * Using ck_tile/opt_padding_fa_train_pr2 and synchronize the backward codes with the changes * Enable using async pipeline in the batched inference path for performance * Re-organize cpp instances for calling fmha infer kernel * Re-organize cpp instances for calling fmha forward kernel * Re-organize cpp instances for calling fmha backward kernel * Position the composable_kernel_tiled to ck_tile/opt_padding_fa_train branch * Update to synchronize with the latest commits in ck_tile/opt_padding_fa_train * update submodule to public * Update to the criteria for padding seqlen_k in batched infer/forward * Keep latest track of ck-tile commits * Tiny fixing to the decoder including * Position the ck-tiled to ck_tile/opt_padding branch * Enable some attn_bias types which were previously disabled by old-ck in ck.py * Add script generate_instances.py which helps to generate instances * Simplify logic for seqstart_q/k ROCm@566d26f has put the seqstart_k/q on device. So simplify the logic here. The upstream xformers don't have this optmization and is copying the seqstart_q/k every iterations. We'd like this change to get in and then merge to upstream. * Add Async pipeline to grouped mode inference path * Use explict true for kPadSeqLenQ/kPadHeadDimQ/kPadHeadDimV templates for the Async pipeline * Synchronize to the update of composable_kernel_tiled for better performance * Update rocm_ci.yml - clean up dangling images after ci run * Avoid unused-const-variable warning Our compiler will error on unused-const-variable warning. So just fix this * Tiny change in the BlockTile/Shape setting overriddings * try to align fmha C++ extension to the ck_tile in ck develop branch * Synchronize composable_kernel_tiled to latest ck develop * Use FmhaFwdTilePartitioner_HBS only with seqlen_k padded cases * Tiny fix/change to make test_forward/test_backward/test_dropout/test_dropout_backward_ck pass * Fix compiling issue with regard to Invoker definitions in forward_decoder/forward_decoder_split operators * Keep using -Woverloaded-virtual * Fix clang-format for headers and cpp files * Fix format in python scripts * Add noqa: C801 for generate_instances.py * Align dispatch_bw with main branch * Align ops/fmha/common.py with main branch * Synchronize the thirty-party/composable_kernel_tiled to latest ck_tile commits for better performance * Relax the atol for test_forward and test_dropout due to the using of packed fp16_2_fp32 conversion in ck_tile * Generate html report for tests run with rocm_ci.yml * archive test results when tests have failed * Always clean up dangling docker images in rocm_ci * Bump python to 3.11 in rocm_ci.yml * Disable flash attention tests rocm_ci.yml Since the op is broken; tbd either make the op work, or disable it on ROCm * Try to fix rocm_ci.yml Init must be called before activation * try to fix rocm_ci.yml flow by overriding PATH * Fix setup.py path in rocm_ci.yml * cd to xformers dir before running install in rocm_ci.yml * Use pip to install xformers in rocm_ci.yml * Possibly fix python version resolution in rocm_ci.yml * Set the correct path for pytest in rocm_ci.yml * remove test_reference_splitk as it was moved to a different file during the first upstream remove test_mqa_forward from develop, as the test fails in develop and doesn't run upstream remove reference attention splitk from the test file; it exists in test_splitk_reference sync test_mem_eff_attention with upstream * make sure ck operators have a name to be visible in the dispatcher * fix sm version checks to happen only on CUDA, not ROCm * (2/n) fix sm version checks to happen only on CUDA, not ROCm * Remove _check_large_shapes checking in fmha/ck.py (#1067) * make xformers install editable to fix cpp extensions detection * Update to using the improved fmha-bwd (compiling passed) * Update to get 80% of the test_backward and test_dropout_backward_ck cases passed * Replace the using of ConvertGradQ by using torch tensor type converting * Change the tile settings for MaxK=32 * Fix padding setting bug in grouped_backward * Change -DCK_FMHA_FWD_FAST_EXP2=1 to -DCK_TILE_FMHA_FWD_FAST_EXP2=1 * Point the composable_kernel_tiled submodule to ck_tile/fa_bwd_opt branch * Disable flshattF and flshattB on ROCM * Add -mllvm and -enable-post-misched=0 compiling options for ROCM on setup.py * Disable flshattF and flshattB on ROCM * Update to support separate grad_q_f32_strides do to the API change in the fmd_bwd_kernel * Use old method for setting BlockDropout due to the revert in fmha_fwd_kernel * Tiny fix in grouped_backward * Use packed tensor allocation for grad_q_f32 * Update to the ConvertGradQ kernel calling * Tiny update * Fix the parameter location in grouped_backward * Adjust headdim128 tile shapes for better performance * Update backward kernel calling due to adding of nhead_stride_dk/nhead_stride_dv parameters * Synchronize with CK to use separate pipeline for kPadHeadDim true of false situtation * Use convertDQ kernel * Update to use unpadded lse layout * Add explicit headdim256 instances for fmha backward * Add leaked headdim256 instance references * Change to generate.py and the re-generate the instance files using it * Change to generate.py to generate instances refences and uses the generated reference headers * Relax the RTOL of ckFwOp from 4e-4 to 3e-3 due to one big result case * Change to use .h rather than .hpp as suffix for generated header files * Fix in .gitignore * Update to bwd setting to use only IGLP pipeline * Synchronize to latest ck_tile fix and align the headdim64 tile shape setting * Reformat the generated instances cpp files * Fix to the backward Trait * Set occupancy to -1 to avoid the compiling warning * Revert "Set occupancy to -1 to avoid the compiling warning" This reverts commit fa6d8b3. * Add environment variable and compiler definition to control the generating of headdim256 instances * Add --ignore-hd256 argument to generate_instance.py and some update in this script * Add environment variable ENABLE_HIP_FMHA_RTN_BF16_CONVERT to enable using rtn bf16 conversion * Remove commented lines in test_mem_eff_attention.py * Synchronize to latest ck_tile commit * apply black * apply flake8 * fix mypy * revert disable flash operator on rocm * Synchronize to ck_tile latest commit again * Re-position the composable_kernel submodule to the develop branch * Avoid the Async pipeline when khasBias is true * clang-format for two files * Change allocation of grouped mode lse from [H, M] to [1, H, M] to match the xformers scripts * Change in generate_instances.py so that this scripts can be called from flexible location * Add manual for generate_instances.py (.md) * Modification in GENERATE_INSTANCES.md * Fix in GENERATE_INSTANCES.md * Update GENERATE_INSTANCES.md * clean-up commented codes * Revert "Change allocation of grouped mode lse from [H, M] to [1, H, M] to match the xformers scripts" This reverts commit 7a91589. * Synchronize to latest ck develop for using the latest RTN bf16 convert * Add c++ extension compiling options for better performance on ROCM 6.2 * Use the same rocm_ci.yml as upstream * Use the same ck.py as upstream * Reformat setup.py --------- Co-authored-by: Max Podkorytov <[email protected]> Co-authored-by: carlushuang <[email protected]> Co-authored-by: Xiaodong Wang <[email protected]>
- Loading branch information