Replies: 8 comments 3 replies
-
Thanks for this well written report. The C/C++ runtime is not an internal dependency of ORT. Every application needs a runtime that allows it to run. Applications typically include this runtime when they're shipped/packaged. ORT is a low-level library. It's not intended for average end users to download. We assume that there'll be applications that provide a useful functionality using ORT and ship ORT along with them. It thus becomes incumbent on the applications to ensure all required dependencies are shipped along with ORT. These dependencies have been clearly identified and documented here. Having said that our error messages can definitely be improved. We'll add this to our backlog. |
Beta Was this translation helpful? Give feedback.
-
Early in the development we tried Cuda package, but we discarded it because in the real world it's impossible to ensure the right drivers and dependencies are properly installed, So we are using the DirectML as the safest option, and I think it's this one that requires vcredist and other dependencies. @pranavsharma you're right that any required dependency needs to be installed by the application package.... but that only solves some problems. The other problem that has been identified is the ORT loading the dependecies from the wrong folder (system32), I am aware that there's ways to tell a process to explicitly load assemblies from a specific directory, and I think this should be done by the ORT, not by the end developer. We agree that error messages need to be improved to help diagnose and narrow the source of any problem. This is specially important when the error happens in a client's machine and we get the error via telemetry logs. |
Beta Was this translation helpful? Give feedback.
-
We are also desperately trying to be able to use onnx runtime in a medical application. We see huge obstacles in this area of knowing which files are needed to be distributed and where to put them on each end user machine to make sure they are the versions loaded even if there are other versions reachable in the path of the end user machine. For the first part we need file lists of which files to include depending on platform and which providers are included. As we don't know what hardware (GPU brand etc.) the end user has we have to ship "the works". On Windows it is easiest to always use DirectML but as CUDA is far faster (last we measured on 1.13) we want to ship that too. On Linux the situation is worse as we have to include separate providers for each GPU make. Referring to a page where you're supposed to do an obscure (to a native C++ programmer) dotnet command to store an unknown set of file in an unknown location is not helpful at all when we have to provide a consistent file set to end users. So far I have not found any ONNX runtime binaries which include all provider dlls in the same package, and I'm unsure if the onnxruntime.dll/so in the provided single-gpu-provider packages are exchangable or only support their respective providers, which would be useless when we need to provide multiple gpu-providers. |
Beta Was this translation helpful? Give feedback.
-
As for the second part where applications may load dlls from the wrong place, getting old ones that came with the OS or Windows Update or some other random application using some random version of ONNX runtime there are problems related to the nested loading of provider libraries and their dependent libraries that I don't know if they are at all solvable unless onnx runtime's cmake system allows setting RPATH (on linux) from the build command line or similar. |
Beta Was this translation helpful? Give feedback.
-
Finally, on the side of error message improvements it seems that if you append a number of provider settings to for instance get CUDA provider if there is a nvidia GPU, get DirectML if there is an AMD or get CPU if none you can't see what you actually got. We're really worried that our customers may miss copying some of the cuDNN related libraries and run on CPU without knowing it (except complaining about slowness) so we really want to be able to know where it runs and why it couldn't run on more preferred providers. |
Beta Was this translation helpful? Give feedback.
-
I think there's a misundertanding.... at least from my side, I am not requesting that OnnxRuntime installs all its dependencies for me.... but what I think it's responsability of OnnxRuntime is to give some way to get a clear Right now the current logic of running OnnxRuntime would be like this: try { TryIntializeCuda(); }
catch
{
try { TryInitializeDirectML(); } // if cuda fails, fall back to DirectML
catch
{
try { tryInitializeCPU(); } // if DirectML fails, fall back to CPU
catch
{
Error("No ORT available");
}
}
} My argument against this approach is that the initialization of the fallback ORTs rely on previous ORTs crashing, potentially leaving unmanaged garbage or leaving the system in an unstable state, which might affect or compromise the execution of the ORT that's been successfully intialized. Also, libraries that have been partially loaded cannot be unloaded. Also, as @BengtGustafsson stated, it may fall back to CPU in cases where it could be possible to fix and run on Cuda or DML. A better approach would be this: var cudaState = Orts.GetCudaORTDiagnostics();
if (cudaState == ReadyToRun)
{
IntializeCuda();
return;
}
else if (cudaState != Unavailable) // diagnose this ORT
{
ApplicationInsights.RemoteLog(cudaState); // send telemetry so we can resolve the issue with client by PHONE CALL
// diagnostics should give enough information to help resolve the problem:
// - incompatible graphics card?
// - missing drivers
// - wrong Cuda drivers version?
// - invalid context? (missing files, wrong dll resolve paths, wrong file versions, who knows)
}
// fallback to DirectML if Cuda is unavailable
var dmlState = Orts.GetDirectMLDiagnostics();
if (dmlState == ReadyToRun)
{
InitializeDML();
return;
}
else if (dmlState != Unavailable) // diagnose this ORT
{
ApplicationInsights.RemoteLog(dmlState); // send telemetry so we can resolve the issue with client by PHONE CALL
// diagnostics should give enough information to help resolve the problem:
// - incompatible graphics card?
// - missing VC Redistributables?
// - invalid context? (missing files, wrong dll resolve paths, wrong file versions, who knows)
}
// fallback to CPU if DML is unavailable
InitializeCPU(); What's really important is that any Diagnostics state given by a given ORT should be divided in three groups:
Notice that what prevents us to do that diagnostics process is we're, after all, end users of OnnxRuntime, and we don't have to know in detail about the requirements of a given ORT. And this doesn't mean there's more that can be done from OnnxRuntime to mitigate the problem; I see DirectML as a fail safe, and it's important it's able to run in most circumstances.... and for that, requiring VCRedist to be installed is a hassle. It could be great to have a |
Beta Was this translation helpful? Give feedback.
-
I think we have similar needs here. Maybe I overreached when requiring file lists, but experience with cuDNN was that there are seemlingly unrelated dlls that are nevertheless needed for operation, and that it is non-trivial to figure it out and that it changed by cuDNN version. This said, such lists would probably best be provided by NVIDIA directly. As for the ORT provider dlls it seems fairly easy to figure out which files are needed. As for vcredist this is unavoidable I think. The program can't even start if it is missing (these dlls invariably load implicitly) so there is no way for the program to report that they are missing. |
Beta Was this translation helpful? Give feedback.
-
Recently I come to hospital two or more days every week. I'd so glad to know ONNX Runtime can provide help there. I'm not sure what kind of diagnostics that could be added to the core onnxruntime.dll, would a separated diagnostic tool be helpful? Statically linking against the VC runtimes only works in some cases. It would be problematic if onnxruntime also needs to load a custom op or dynamically loaded EP(e.g. CUDA). Because, when you static link VC Runtime, every DLL has its own heap. There will be more restrictions when we pass C++ objects across DLL boundaries. Also, nowadays VC runtime is not a single piece. See this: https://devblogs.microsoft.com/cppblog/introducing-the-universal-crt/ . So, if your target environments are Windows 10 and above, I think it should be easy to do App-local deployment as mentioned in the article. I think it is more preferable than static linking. |
Beta Was this translation helpful? Give feedback.
-
This error will be familiar to some people here:
We thought we had this issue resolved by installing VCRedist, but we're still getting this error in some client's machines , and the issue looks similar to #13744
Now, as I can see, this is a recurring problem that keeps popping time and again: #116, #5449, #9260, #11230, #13744
At least two possible causes for this error have been identified:
The problem is that after all, this is a ONNX internal dependency, and good practices say that whoever has a dependency is responsible to correctly load it, or at least, to give meaningful diagnostics information to help the end developer fix the problem.
After reading many of the threads, and from other conversations I had with developers here and in other MS departments, I have the impression many developers have a hard time understanding that OnnxRuntime is not limited to servers and needs to be executed in an average clueless end user's machine... Asking to install obscure dependencies is already extremely painful, and we have to go to great lengths to include it as part of our installer as seamlessly as possible.
Furthermore, OnnxRuntime is being used for medical applications that need to be installed in computers within hospital environments with extremely tight security. I can't, but I would love to share some chats I had with hospital IT personnel when I required them to install VC redist or to just send me additional log files
But the biggest problem is that if that exception happens in a client's machine, we get notice through Applicatios Insights logging, so we have no way to diagnose why it happened, and giving solutions that require full access to the affected machine is not useful at all.
So I would humbly ask OnnxRuntime developers to expend some time to improve the developers and end users experience regarding this issue in these areas:
Beta Was this translation helpful? Give feedback.
All reactions