Replies: 1 comment
-
In the meantime I did some digging around, and found that deserialization of large models, and graph transformation applied during load are the main causes of slow session creation. To address this I started to work on a modification of the onnx runtime, which adds an optional memory residency API to inference session and execution provider (DirectML in my case). This API would help cases where there is enough RAM to store all model data in system memory, but there is not enough memory to store all of it in VRAM. Using the API one can evict sessions from VRAM and move them to RAM while they are not actively used - while running a new inference will bring them back. This is much faster than recreating sessions, since we do need to deserialize and transform graphs again. If everything goes well, multi-model AI processes, such as image generation would see a very significant boost to performance, as it could easy save 30-50% of computation time and also the OS would not hang. In DirectML I tried the memory residency API first, but the improvement was not as good as expected, as sometime the memory would still be filled causing Windows to hang for 2-5 seconds (mouse freezes, videos/music stop etc.). I am now planning to simply read back and then destroy dirty heaps when Evict is called, then load them back if the session is ran again. Still if you have a better way to do this already, let me know. |
Beta Was this translation helpful? Give feedback.
-
I am working on C++ ONNX runtime app for running Stable Diffusion called Unpaint with DirectML - currently I am integrating SD XL into it. As you might know running SD requires running inference on a number of different models (text encoder, unet, vae decoder) to generate each image.
In simpler cases, or if there is a ton of VRAM (e.g. 24GB or more) this works fine. But even with mildly complicated pipelines such as running SD 1.5 + ControlNet, or SD XL in itself on a 12GB VRAM GPU trouble begins as the system will run out of VRAM and stutters very badly as shared GPU memory is used, and then generation slows down a lot.
Now the answer is simple in my opinion: we should just swap out model weights not used by the current inference. Since each model fits in VRAM one by one, and we only need one at a time, all models fit in system RAM, while RAM and VRAM can easily transfer many tens of GBs each second, this should work just fine. But alas it does not.
The best I can do is create an inference session, run it once, then destroy it, and repeat. While this prevents overloading the VRAM, the session creation is super slow for the unet, it takes as much as the rest of the image generation process. So, it is not really faster, the only benefit is that you can use the computer while it is running, instead of having it hang half the time.
So my question is how to do this better? The model is ~5GB, RAM can do 20GB/s+ copies as I know, PCIe is fast as well, so why does this take 10s? I think it should take less than one.
PS: If you have some pointers about other options to reduce memory usage or optimize models for load time as well - I am using Olive already - that would help. Honestly, I feel the general documentation about ONNX runtime and tools like Olive seems to be either poor, or hidden from search engines well. While I can see examples for running one network or other, there seem to be very little documentation about performance optimization or how to use many of the APIs, which is honestly most critical for many applications.
Beta Was this translation helpful? Give feedback.
All reactions