Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the difference between cuda-l2-cache and gpu-cache benchmarks? #3

Open
beginlner opened this issue May 19, 2023 · 4 comments
Open

Comments

@beginlner
Copy link

No description provided.

@te42kyfo
Copy link
Collaborator

I have written cuda-L2-cache specifically to benchmark the L2 cache bandwidth only. It simulates a scenario, where data is being read repeatedly from thread blocks on SMs all over the chip. The variable blockRun is adjusted to the total amount of simultaneously running thread blocks. Each thread reads N pieces of data, in a grid stride loop. Each piece of data is being read by 10000 different thread blocks (see line 57 int blockCount = blockRun * 10000;). By adjusting N, the total volume can be adjusted ( N * blockSize * blockRun * 8 Byte), which decides whether data fits in cache

Think of it as a grid stride loop over some data volume of as many threads as can run simultenously (a 'wave'). Only that afterwards, 10000 more waves will do the same thing, only that this time, the distribution of thread blocks is different. This is benchmark was written for this paper (see Figure 2). The peculiar way of how data is repeatedly read by different thread blocks is because of A100's segmented L2 cache, where L2 cache data repeatedly being read by the same thread block would show higher L2 cache capacity, because there would not be duplication. Whereas with this scheme, data would need to be duplicated, because reads come from SMs attached to different L2 cache segments.

The gpu-cache benchmark is a general cache benchmark for both L1 and L2 cache. Because each thread block reads the same data as all the others, it never falls out of L2 cache. Even if the data volume exceeds L2 cache capacity, there would be reuse in the L2 cache by different thread blocks.

@guohaoqiang
Copy link

I have written cuda-L2-cache specifically to benchmark the L2 cache bandwidth only. It simulates a scenario, where data is being read repeatedly from thread blocks on SMs all over the chip. The variable blockRun is adjusted to the total amount of simultaneously running thread blocks. Each thread reads N pieces of data, in a grid stride loop. Each piece of data is being read by 10000 different thread blocks (see line 57 int blockCount = blockRun * 10000;). By adjusting N, the total volume can be adjusted ( N * blockSize * blockRun * 8 Byte), which decides whether data fits in cache

Think of it as a grid stride loop over some data volume of as many threads as can run simultenously (a 'wave'). Only that afterwards, 10000 more waves will do the same thing, only that this time, the distribution of thread blocks is different. This is benchmark was written for this paper (see Figure 2). The peculiar way of how data is repeatedly read by different thread blocks is because of A100's segmented L2 cache, where L2 cache data repeatedly being read by the same thread block would show higher L2 cache capacity, because there would not be duplication. Whereas with this scheme, data would need to be duplicated, because reads come from SMs attached to different L2 cache segments.

The gpu-cache benchmark is a general cache benchmark for both L1 and L2 cache. Because each thread block reads the same data as all the others, it never falls out of L2 cache. Even if the data volume exceeds L2 cache capacity, there would be reuse in the L2 cache by different thread blocks.

When I run gpu-l2-cache on h100 PCIe, it exposes a weird bw column (the bw of L2 shall be around 7500GB/s). Do I need to change the code?

Screen Shot 2023-11-26 at 9 58 32 PM

@te42kyfo
Copy link
Collaborator

Your results look absolutely in line with what I had measured myself before. Regarding the very high numbers in the beginning:
Initially, for the first few dataset sizes, there is still some coverage by the 256kB L1 cache. For example, the 2048kB data point consists of 8 blocks of 256kB, so there is a 1 in 8 chance that a thread block runs on a SM where the previous, just exited thread block had worked on the same block of data which then still resides in the L1 cache. It eventually settles to around 6700 GB/s, which is the pure L2 bandwidth.

For the data in the plot and the included data, I have changed the per thread block data set from 256kB to 512kB, exactly because of this reason. This reduces this effect, but doesn't eliminate it so you still should not use the first few values. Instead, use the values right before the dataset drops out of the L2 cache into memory. With 512kB per thread block, I get 7TB/s.

@te42kyfo
Copy link
Collaborator

The used parameters (256kB) had been fine before, but doesn't work as well for the increased L1 cache in H100. The CL replacement strategy might also have changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants