Can we avoid allocating a significant amount of memory when executing in a multithreaded environment when precision is changed? #23484
Unanswered
tmbrye
asked this question in
Performance Q&A
Replies: 1 comment 5 replies
-
This allocation is per thread that calls |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We have run the Java onnxruntime for quite a while in a multi-threaded environment and have not noticed memory issues until we started trying very large HuggingFace embedding models. We read in the model and cache it in memory and then multiple threads on the system utilize the run method to score their inputs. This actually works as expected for models that are converted to ONNX without changing the precision of the model.
We read in the model into memory(in this case a 1.17GB model) and it allocates the space it needs which is around 8GB. Then the threads execute the run method and the memory may increase ever so slightly but nothing out of the ordinary. However, when we have converted this model using fp16 or int8 precision the model is obviously reduced in size but upon execution it allocates the space it needs but also starts allocating a good amount more every time a new thread joins the process.
This very quickly overwhelms the system with cpu and memory increases.
My questions is, is this expected behavior that when a model uses fp16 precision that each thread will basically allocate another copy of the model? Are there any settings that might be helpful to mitigate this issue? Thanks in advance.
Just some added information we were on 1.13.1 and moved to 1.16.3 and it is still the same. We have an older glibc so we aren't able to go to higher versions unless we compile ourselves.
Beta Was this translation helpful? Give feedback.
All reactions