[QST]Question about the picture in documentation `Efficient GEMM in CUDA` #2034

sleepwalker2017 · 2025-01-09T03:28:44Z

I notice the picture in this manual: https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md

The partition from global memory to shared memory blocks is easy to understand.

My question comes from the 2nd part: Thread Block Tile.

In the picture, it seems to use an External product, which uses a column in A and a row in B to generate a matrix C.

A.shape (m, 1), B.shape (1, N) -> C.shape (M, N)

Is that the fact?

If so, why is it different from the 1st block partition?

The text was updated successfully, but these errors were encountered:

leimao · 2025-01-09T05:15:01Z

You can design how a tile is computed in almost any way you like. Although in the diagram the K dimension for the two matrices seems to be 1, but it does not always have to be 1.

See some of my posts:

CUDA Matrix Multiplication Optimization which I probably have used K = 1 for the convenience of implementation because I did not use CUTLASS.
CuTe Tiled MMA which uses K = 16 when CuTe was used.

sleepwalker2017 · 2025-01-09T06:40:53Z

CuTe Tiled MMA

Great！ Thank you for the great post, I'll learn it in depth!

sleepwalker2017 · 2025-01-09T08:55:53Z

in

So the Thread Block Tile is almost the same as the Blocked GEMM. It's just a misleading in the picture.

Is there any outer product in the computing of a tile?

I see your codes in the 1st post, it uses multiple levels of tiling, but it seems no outer product is used.

I see outer product in this documentation. Do you know what it means?
https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md#thread-level-gemm

leimao · 2025-01-09T17:21:54Z

I see your codes in the 1st post, it uses multiple levels of tiling, but it seems no outer product is used.

for (size_t k_i{0U}; k_i < BLOCK_TILE_SIZE_K; ++k_i) This is where the outer product is performed IIRC. You can Ctrl + F to search "outer product" in the article.

I see outer product in this documentation. Do you know what it means?

Thread-level GEMM can be implemented for CUDA Cores. If we want to utilize TensorCore, we should use warp-level GEMM (Although for older architectures such as Volta, quadpair-level GEMM can also be used).

leimao · 2025-01-09T17:25:07Z

Again, although I said the K dimension for the two matrices seems to be 1, it does not always have to be 1. The diagram never explicitly stated that K = 1 so it's not completely wrong.

sleepwalker2017 · 2025-01-10T02:33:35Z

I see your codes in the 1st post, it uses multiple levels of tiling, but it seems no outer product is used.

for (size_t k_i{0U}; k_i < BLOCK_TILE_SIZE_K; ++k_i) This is where the outer product is performed IIRC. You can Ctrl + F to search "outer product" in the article.

I see outer product in this documentation. Do you know what it means?

Thread-level GEMM can be implemented for CUDA Cores. If we want to utilize TensorCore, we should use warp-level GEMM (Although for older architectures such as Volta, quadpair-level GEMM can also be used).

sorry, I didn't notice it. I'll read the chapter more carefully.

sleepwalker2017 added ? - Needs Triage question Question labels Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST]Question about the picture in documentation `Efficient GEMM in CUDA` #2034

[QST]Question about the picture in documentation `Efficient GEMM in CUDA` #2034

sleepwalker2017 commented Jan 9, 2025 •

edited

Loading

leimao commented Jan 9, 2025 •

edited

Loading

sleepwalker2017 commented Jan 9, 2025

sleepwalker2017 commented Jan 9, 2025

leimao commented Jan 9, 2025

leimao commented Jan 9, 2025 •

edited

Loading

sleepwalker2017 commented Jan 10, 2025 •

edited

Loading

[QST]Question about the picture in documentation Efficient GEMM in CUDA #2034

[QST]Question about the picture in documentation Efficient GEMM in CUDA #2034

Comments

sleepwalker2017 commented Jan 9, 2025 • edited Loading

leimao commented Jan 9, 2025 • edited Loading

sleepwalker2017 commented Jan 9, 2025

sleepwalker2017 commented Jan 9, 2025

leimao commented Jan 9, 2025

leimao commented Jan 9, 2025 • edited Loading

sleepwalker2017 commented Jan 10, 2025 • edited Loading

[QST]Question about the picture in documentation `Efficient GEMM in CUDA` #2034

[QST]Question about the picture in documentation `Efficient GEMM in CUDA` #2034

sleepwalker2017 commented Jan 9, 2025 •

edited

Loading

leimao commented Jan 9, 2025 •

edited

Loading

leimao commented Jan 9, 2025 •

edited

Loading

sleepwalker2017 commented Jan 10, 2025 •

edited

Loading