Flash attention #18270
Replies: 3 comments
-
I'm also paying attention to this problem. flash-atten doesn't seem to be handled well, which leads to the problem of insufficient memory when ONNXRUNTIME reasoning. |
Beta Was this translation helpful? Give feedback.
-
What is the current status of FlashAttention in ONNXRuntime? |
Beta Was this translation helpful? Give feedback.
-
Current compiling onnx runtime from source. I do see a bunch of files related to flash attention being compiled.
It looks like this implementation is called from onnxruntime's Attention op under certain conditions, which you can see here There are some heuristics that onnxruntime does before it selects flash attention. If the sequence length is low, for instance, it uses another method. You also need to make sure you pass all the conditions here I'm not sure what some of them mean. |
Beta Was this translation helpful? Give feedback.
-
Does Onnxruntime use flash attention ?
I noticed in contrib operations there are CPU and CUDA implementations of memory efficient attention. Are they used generally in the CPU and CUDA providers or are they specific to BERT?
For example, does Pytorch's
scaled_dot_product_attention()
get ONNX-exported to an efficient kernel, or does it get unfolded into a bunch of matmult operations?Thank you.
Beta Was this translation helpful? Give feedback.
All reactions