I write kernels here
- Fused Self Attention (forward)
- check out the giude to understanding it: https://alexdremov.me/understanding-flash-attention-writing-the-algorithm-from-scratch-in-triton/
- Streaming Attention (forward, backward, pt2 compliant)
- detailed description in the docstring
- TBD: guide to why it is cool (nudge me if interested)