-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST]Question about the picture in documentation Efficient GEMM in CUDA
#2034
Comments
You can design how a tile is computed in almost any way you like. Although in the diagram the K dimension for the two matrices seems to be 1, but it does not always have to be 1. See some of my posts:
|
Great! Thank you for the great post, I'll learn it in depth! |
So the Thread Block Tile is almost the same as the Blocked GEMM. It's just a misleading in the picture. Is there any I see your codes in the 1st post, it uses multiple levels of tiling, but it seems no I see outer product in this documentation. Do you know what it means? |
Thread-level GEMM can be implemented for CUDA Cores. If we want to utilize TensorCore, we should use warp-level GEMM (Although for older architectures such as Volta, quadpair-level GEMM can also be used). |
Again, although I said the K dimension for the two matrices seems to be 1, it does not always have to be 1. The diagram never explicitly stated that K = 1 so it's not completely wrong. |
sorry, I didn't notice it. I'll read the chapter more carefully. |
I notice the picture in this manual: https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md
The partition from global memory to shared memory blocks is easy to understand.
My question comes from the 2nd part: Thread Block Tile.
In the picture, it seems to use an External product, which uses a column in A and a row in B to generate a matrix C.
A.shape (m, 1), B.shape (1, N) -> C.shape (M, N)
Is that the fact?
If so, why is it different from the 1st block partition?
The text was updated successfully, but these errors were encountered: