Flash attention #18270

pfeatherstone · 2023-11-03T08:32:05Z

pfeatherstone
Nov 3, 2023

Does Onnxruntime use flash attention ?
I noticed in contrib operations there are CPU and CUDA implementations of memory efficient attention. Are they used generally in the CPU and CUDA providers or are they specific to BERT?
For example, does Pytorch's scaled_dot_product_attention() get ONNX-exported to an efficient kernel, or does it get unfolded into a bunch of matmult operations?
Thank you.

lauvli · 2024-10-21T07:42:16Z

lauvli
Oct 21, 2024

I'm also paying attention to this problem. flash-atten doesn't seem to be handled well, which leads to the problem of insufficient memory when ONNXRUNTIME reasoning.

0 replies

MonolithFoundation · 2025-01-07T06:51:50Z

MonolithFoundation
Jan 7, 2025

What is the current status of FlashAttention in ONNXRuntime?

0 replies

axbycc-mark · 2025-01-20T23:28:02Z

axbycc-mark
Jan 20, 2025

Current compiling onnx runtime from source. I do see a bunch of files related to flash attention being compiled.

...
[ 68%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/home/axby/onnxruntime/onnxruntime/contrib_ops/cuda/bert/cudnn_fmha/cudnn_flash_attention.cu.o
[ 68%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/home/axby/onnxruntime/onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/fmha_sm50.cu.o
[ 68%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/home/axby/onnxruntime/onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/fmha_sm70.cu.o
[ 68%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/home/axby/onnxruntime/onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/fmha_sm75.cu.o
[ 68%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/home/axby/onnxruntime/onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/fmha_sm80.cu.o
[ 68%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/home/axby/onnxruntime/onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/memory_efficient_attention.cu.o
[ 68%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/home/axby/onnxruntime/onnxruntime/contrib_ops/cuda/bert/decoder_attention_impl.cu.o
[ 68%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/home/axby/onnxruntime/onnxruntime/contrib_ops/cuda/bert/embed_layer_norm_impl.cu.o
[ 68%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/home/axby/onnxruntime/onnxruntime/contrib_ops/cuda/bert/fastertransformer_decoder_attention/decoder_masked_multihead_attention_128.cu.o
[ 68%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/home/axby/onnxruntime/onnxruntime/contrib_ops/cuda/bert/fastertransformer_decoder_attention/decoder_masked_multihead_attention_32.cu.o
[ 69%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/home/axby/onnxruntime/onnxruntime/contrib_ops/cuda/bert/fastertransformer_decoder_attention/decoder_masked_multihead_attention_64.cu.o
[ 69%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/home/axby/onnxruntime/onnxruntime/contrib_ops/cuda/bert/fastertransformer_decoder_attention/decoder_masked_multihead_attention_impl.cu.o
...

It looks like this implementation is called from onnxruntime's Attention op under certain conditions, which you can see here

onnxruntime/onnxruntime/contrib_ops/cuda/bert/attention.cc

Line 176 in 7c90a9b

bool use_fused_runner = !disable_fused_self_attention_ &&

There are some heuristics that onnxruntime does before it selects flash attention. If the sequence length is low, for instance, it uses another method. You also need to make sure you pass all the conditions here

onnxruntime/onnxruntime/contrib_ops/cuda/bert/attention.cc

Line 109 in 7c90a9b

bool use_flash_attention = !disable_flash_attention_ &&

I'm not sure what some of them mean.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash attention #18270

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Flash attention #18270

pfeatherstone Nov 3, 2023

Replies: 3 comments

lauvli Oct 21, 2024

MonolithFoundation Jan 7, 2025

axbycc-mark Jan 20, 2025

pfeatherstone
Nov 3, 2023

lauvli
Oct 21, 2024

MonolithFoundation
Jan 7, 2025

axbycc-mark
Jan 20, 2025