-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What's the difference between cuda-l2-cache and gpu-cache benchmarks? #3
Comments
I have written cuda-L2-cache specifically to benchmark the L2 cache bandwidth only. It simulates a scenario, where data is being read repeatedly from thread blocks on SMs all over the chip. The variable Think of it as a grid stride loop over some data volume of as many threads as can run simultenously (a 'wave'). Only that afterwards, 10000 more waves will do the same thing, only that this time, the distribution of thread blocks is different. This is benchmark was written for this paper (see Figure 2). The peculiar way of how data is repeatedly read by different thread blocks is because of A100's segmented L2 cache, where L2 cache data repeatedly being read by the same thread block would show higher L2 cache capacity, because there would not be duplication. Whereas with this scheme, data would need to be duplicated, because reads come from SMs attached to different L2 cache segments. The gpu-cache benchmark is a general cache benchmark for both L1 and L2 cache. Because each thread block reads the same data as all the others, it never falls out of L2 cache. Even if the data volume exceeds L2 cache capacity, there would be reuse in the L2 cache by different thread blocks. |
When I run gpu-l2-cache on h100 PCIe, it exposes a weird bw column (the bw of L2 shall be around 7500GB/s). Do I need to change the code? |
Your results look absolutely in line with what I had measured myself before. Regarding the very high numbers in the beginning: For the data in the plot and the included data, I have changed the per thread block data set from 256kB to 512kB, exactly because of this reason. This reduces this effect, but doesn't eliminate it so you still should not use the first few values. Instead, use the values right before the dataset drops out of the L2 cache into memory. With 512kB per thread block, I get 7TB/s. |
The used parameters (256kB) had been fine before, but doesn't work as well for the increased L1 cache in H100. The CL replacement strategy might also have changed. |
No description provided.
The text was updated successfully, but these errors were encountered: