Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST]Question about the picture in documentation Efficient GEMM in CUDA #2034

Open
sleepwalker2017 opened this issue Jan 9, 2025 · 6 comments

Comments

@sleepwalker2017
Copy link

sleepwalker2017 commented Jan 9, 2025

I notice the picture in this manual: https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md

The partition from global memory to shared memory blocks is easy to understand.

My question comes from the 2nd part: Thread Block Tile.

In the picture, it seems to use an External product, which uses a column in A and a row in B to generate a matrix C.

A.shape (m, 1), B.shape (1, N) -> C.shape (M, N)

Is that the fact?

If so, why is it different from the 1st block partition?

Image

@leimao
Copy link
Contributor

leimao commented Jan 9, 2025

You can design how a tile is computed in almost any way you like. Although in the diagram the K dimension for the two matrices seems to be 1, but it does not always have to be 1.

See some of my posts:

@sleepwalker2017
Copy link
Author

  • CuTe Tiled MMA

Great! Thank you for the great post, I'll learn it in depth!

@sleepwalker2017
Copy link
Author

in

So the Thread Block Tile is almost the same as the Blocked GEMM. It's just a misleading in the picture.

Is there any outer product in the computing of a tile?

I see your codes in the 1st post, it uses multiple levels of tiling, but it seems no outer product is used.

I see outer product in this documentation. Do you know what it means?
https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md#thread-level-gemm

@leimao
Copy link
Contributor

leimao commented Jan 9, 2025

I see your codes in the 1st post, it uses multiple levels of tiling, but it seems no outer product is used.

for (size_t k_i{0U}; k_i < BLOCK_TILE_SIZE_K; ++k_i) This is where the outer product is performed IIRC. You can Ctrl + F to search "outer product" in the article.

I see outer product in this documentation. Do you know what it means?

Thread-level GEMM can be implemented for CUDA Cores. If we want to utilize TensorCore, we should use warp-level GEMM (Although for older architectures such as Volta, quadpair-level GEMM can also be used).

@leimao
Copy link
Contributor

leimao commented Jan 9, 2025

Again, although I said the K dimension for the two matrices seems to be 1, it does not always have to be 1. The diagram never explicitly stated that K = 1 so it's not completely wrong.

@sleepwalker2017
Copy link
Author

sleepwalker2017 commented Jan 10, 2025

I see your codes in the 1st post, it uses multiple levels of tiling, but it seems no outer product is used.

for (size_t k_i{0U}; k_i < BLOCK_TILE_SIZE_K; ++k_i) This is where the outer product is performed IIRC. You can Ctrl + F to search "outer product" in the article.

I see outer product in this documentation. Do you know what it means?

Thread-level GEMM can be implemented for CUDA Cores. If we want to utilize TensorCore, we should use warp-level GEMM (Although for older architectures such as Volta, quadpair-level GEMM can also be used).

sorry, I didn't notice it. I'll read the chapter more carefully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants