Replies: 2 comments
-
Thank you for sharing! |
Beta Was this translation helpful? Give feedback.
-
Thanks for answering arrufat. No, it's my bad, I should have provided some code to make it easier to understand, because in fact what I did is really simple. This is what I changed in the cache: diff --git a/dlib/cuda/cudnn_dlibapi.cpp b/dlib/cuda/cudnn_dlibapi.cpp
index 06badceb..ec940d41 100644
--- a/dlib/cuda/cudnn_dlibapi.cpp
+++ b/dlib/cuda/cudnn_dlibapi.cpp
@@ -793,12 +793,13 @@ namespace dlib
{
// Calling the cuDNN "find the best algorithm" functions is really slow. So we keep a
// cache that tells us what method was best for a particular configuration.
- thread_local std::map<std::tuple<int,int,int,int,long,long>,
+ thread_local std::map<std::tuple<int,int,int,int,long,long,long,long,long,long,long,long,int,int,int,int>,
std::tuple<int,int,int>> config_to_algo_cache;
// If we have already found good algorithms for this setting then just pull them from
// the cache.
- const auto cache_key = std::make_tuple(stride_y, stride_x, padding_y, padding_x, filters_nr, filters_nc);
+ const auto cache_key = std::make_tuple(stride_y, stride_x, padding_y, padding_x, data_num_samples, data_k, data_nr, data_nc,
+ filters_num_samples, filters_k, filters_nr, filters_nc, out_num_samples, out_k, out_nr, out_nc);
const auto iter = config_to_algo_cache.find(cache_key);
if (iter != config_to_algo_cache.end() && allow_cache_use_ == allow_cache_use::yes)
{
It is really short. If you copy that into a patch and make a git apply, you can test it to see results similar to the table above (or so I hope). |
Beta Was this translation helpful? Give feedback.
-
My intention was to directly create a PR with this, but I've run into many problems in the middle, so I decided to open a discussion first, and depending on what comes out of this, I'll do the PR or not.
At the moment, dlib uses these values for the cache key: stride_y, stride_x, padding_y, padding_x, filters_nr, filters_nc. What I've noticed while doing some tests is that the execution time is less when disabling the cache. This is obviously not a good solution, because the cache is there to avoid performance issues when processing images of different sizes. So I thought about adding more parameters to it, and that worked just as well as not using the cache at all.
I've been looking at cudnn-frontend project, and although Nvidia is using
cuDNN Backend API
there, they also have a cache, but with a lot of parameters (some I don't even know what they are). This kind of validates my previous thought.But there are two problems:
First, when testing on more models, I found that resnet18 and resnet34 work slower with this configuration, which is very strange. It seems that the algorithm chosen by
cudnnFindConvolutionForwardAlgorithm
is faster when running that layer alone, but when running the entire network, they are slower. Here's a table with some tests done in my computer with a 1650 GTX:Comparative 1: how much slower is the first execution.
Comparative 2: how much faster is on average (without the first execution).
There are some models that are not improving, which is fine. Others improve (some a lot, like googlenet, resnet50 and resnet152, even yolov3 when training). But as you can see, resnet18 and resnet34 are running slower, I don't know why.
The second problem is that there will be a lot of cache misses, in case the size of the images varies. I've read that those algorithms tend to be more or less efficient based on large differences in input size, output size, kernel size, and number of channels (with different combinations of each). So I thought about not recalculating every time some minor change in parameters appears. To account for that, I tried to take only the most significant bit of each number and use that for the cache. This way, we'll update the cache only for powers of 2. This worked just as well as my previous approach, but I can't verify that it works in all other cases or gpus.
So, in conclusion, I think we could make dlib faster by changing the cache, but I don't like the idea of those two really common classification models running slower, and neither the idea of using something I can't prove works for all cases. What do you think?
Beta Was this translation helpful? Give feedback.
All reactions