Update cudnn cache parameters #2841

facug91 · 2023-07-26T20:28:10Z

facug91
Jul 26, 2023

My intention was to directly create a PR with this, but I've run into many problems in the middle, so I decided to open a discussion first, and depending on what comes out of this, I'll do the PR or not.

At the moment, dlib uses these values for the cache key: stride_y, stride_x, padding_y, padding_x, filters_nr, filters_nc. What I've noticed while doing some tests is that the execution time is less when disabling the cache. This is obviously not a good solution, because the cache is there to avoid performance issues when processing images of different sizes. So I thought about adding more parameters to it, and that worked just as well as not using the cache at all.

I've been looking at cudnn-frontend project, and although Nvidia is using cuDNN Backend API there, they also have a cache, but with a lot of parameters (some I don't even know what they are). This kind of validates my previous thought.

But there are two problems:

First, when testing on more models, I found that resnet18 and resnet34 work slower with this configuration, which is very strange. It seems that the algorithm chosen by cudnnFindConvolutionForwardAlgorithm is faster when running that layer alone, but when running the entire network, they are slower. Here's a table with some tests done in my computer with a 1650 GTX:

Model	First frame (master)	Avg time (ms)(master)	First frame (New cache)	Comparative 1	Avg time (ms)(New cache)	Comparative 2
alexnet	2652.48	10.045	5110.758	92.68%	10.017	0.28%
sqznet1.0	1510.588	7.372	2485.913	64.57%	6.858	6.97%
sqznet1.1	1450.591	4.63	1816.301	25.21%	4.254	8.12%
vggnet11	3714.155	36.827	5168.084	39.15%	36.88	-0.14%
vggnet13	3717.567	48.688	5817.931	56.50%	45.324	6.91%
vggnet16	3875.641	59.156	5814.872	50.04%	56.438	4.59%
vggnet19	3862.167	72.416	5897.416	52.70%	67.538	6.74%
googlenet	1914.388	12.685	3463.816	80.94%	10.677	15.83%
resnet18	1688.354	7.818	2053.724	21.64%	8.205	-4.95%
resnet34	1814.24	14.41	2104.005	15.97%	14.93	-3.61%
resnet50	1875.542	27.498	4677.532	149.40%	22.699	17.45%
resnet101	1989.704	38.993	4561.694	129.26%	38.908	0.22%
resnet152	2138.381	68.124	4622.585	116.17%	55.715	18.22%
darknet19	1718.458	13.613	3290.39	91.47%	13.627	-0.10%
darknet53	1952.134	32.931	4216.866	116.01%	32.636	0.90%
darknet53csp	2115.082	27.189	5231.411	147.34%	26.265	3.40%
yolov3 (train)	2217.333	193.667	4243.333	91.37%	169.0	12.74%

Comparative 1: how much slower is the first execution.
Comparative 2: how much faster is on average (without the first execution).

There are some models that are not improving, which is fine. Others improve (some a lot, like googlenet, resnet50 and resnet152, even yolov3 when training). But as you can see, resnet18 and resnet34 are running slower, I don't know why.

The second problem is that there will be a lot of cache misses, in case the size of the images varies. I've read that those algorithms tend to be more or less efficient based on large differences in input size, output size, kernel size, and number of channels (with different combinations of each). So I thought about not recalculating every time some minor change in parameters appears. To account for that, I tried to take only the most significant bit of each number and use that for the cache. This way, we'll update the cache only for powers of 2. This worked just as well as my previous approach, but I can't verify that it works in all other cases or gpus.

So, in conclusion, I think we could make dlib faster by changing the cache, but I don't like the idea of those two really common classification models running slower, and neither the idea of using something I can't prove works for all cases. What do you think?

arrufat · 2023-07-27T00:48:42Z

arrufat
Jul 27, 2023

Thank you for sharing!
Honestly, I need to read more to completely understand what you did here.
I'll come back when I feel like I know what I'm talking about and can provide useful feedback.
However, maybe you're going beyond my skill set here, so it might take some time.

0 replies

facug91 · 2023-07-27T03:46:25Z

facug91
Jul 27, 2023
Author

Thanks for answering arrufat. No, it's my bad, I should have provided some code to make it easier to understand, because in fact what I did is really simple. This is what I changed in the cache:

diff --git a/dlib/cuda/cudnn_dlibapi.cpp b/dlib/cuda/cudnn_dlibapi.cpp
index 06badceb..ec940d41 100644
--- a/dlib/cuda/cudnn_dlibapi.cpp
+++ b/dlib/cuda/cudnn_dlibapi.cpp
@@ -793,12 +793,13 @@ namespace dlib
         {
             // Calling the cuDNN "find the best algorithm" functions is really slow.  So we keep a
             // cache that tells us what method was best for a particular configuration.
-            thread_local std::map<std::tuple<int,int,int,int,long,long>,
+            thread_local std::map<std::tuple<int,int,int,int,long,long,long,long,long,long,long,long,int,int,int,int>,
                                   std::tuple<int,int,int>> config_to_algo_cache;
 
             // If we have already found good algorithms for this setting then just pull them from
             // the cache.
-            const auto cache_key = std::make_tuple(stride_y, stride_x, padding_y, padding_x, filters_nr, filters_nc);
+            const auto cache_key = std::make_tuple(stride_y, stride_x, padding_y, padding_x, data_num_samples, data_k, data_nr, data_nc,
+                                                   filters_num_samples, filters_k, filters_nr, filters_nc, out_num_samples, out_k, out_nr, out_nc);
             const auto iter = config_to_algo_cache.find(cache_key);
             if (iter != config_to_algo_cache.end() && allow_cache_use_ == allow_cache_use::yes)
             {

It is really short. If you copy that into a patch and make a git apply, you can test it to see results similar to the table above (or so I hope).
The second test is the same, but I used __builtin_clz to calculate the most significant bit of each parameter, before creating the tuple. And that's all hahah. (I know it wouldn't be cross platform to use that function, but it was only for local testing)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update cudnn cache parameters #2841

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Update cudnn cache parameters #2841

facug91 Jul 26, 2023

Replies: 2 comments

arrufat Jul 27, 2023

facug91 Jul 27, 2023 Author

facug91
Jul 26, 2023

arrufat
Jul 27, 2023

facug91
Jul 27, 2023
Author