-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Camera take_picture gets stuck sometimes #171
Comments
The out-of-memory is actually due to exhaustion of fences. The message by Vulkan is a bit misleading. The driver is simply saying it cannot create more fences. For each camera, SAPIEN will create a fence to synchronize its work. When creating many cameras across many processes, it seems to hit a limit of the maximum fences that can be created system-wide defined by the driver. For the frozen issue, I cannot really tell what is causing it. I may be able to take a look if you can reproduce it with pure SAPIEN code. SimplerEnv seems to be on top of ManiSkill2, which is 2 layers of encapsulations from SAPIEN and many things could go wrong. |
@XYZ-99 since you are using SIMPLER I am still migrating that over to ManiSkill 3 so you can't quite use the new system yet. Hopefully I will have some example environments in SIMPLER ported over to the GPU sim + rendering system for faster batched evaluation, currently only have some of the CPU side of things done. |
@fbxiang Thank you for your timely response!
|
I actually partially migrated SIMPLER just yesterday. See this for how to access the parallelized simpler environments (only bridge dataset atm) and how to run inference fast on it: https://maniskill.readthedocs.io/en/latest/tasks/digital_twins/index.html#bridgedata-v2-evaluation Runs about 60-100x faster than real world evaluation speed, 10x faster than CPU sim (and higher if you have a good gpu) |
@StoneT2000 Thank you for your reply! |
Google robot is complicated because it uses ruckig originally for the controller. I don't know if there exists a GPU parallelized version of the google robot controller out there, so someone would need to write cuda / pytorch code to imitate that controllers behavior. This is the hardest part and why I avoided adding google robot tasks for now. It's possible but not trivial. As a result the timeline is uncertain. However there is also a possibility one can approximate the controller using existing maniskill controllers and tuning them heavily. I haven't investigated this deeply though. It is currently easier now to add new environments for robots that use existing controllers (like pd joint delta pos, or ik based ones that control the end effector like the robot in bridge dataset) |
Got it! It sounds non-trivial to take advantage of the GPU-parallelization for google robot then. |
@fbxiang Actually deep down the camera is stuck at this line: |
System:
Describe the bug
multiprocessing
jobs, I do have seen in my own log files a traceback which might be related to this problem:I'm not sure if they are related.
P.S. 1. I am 80% sure that my GPUs didn't go out of memory even if it says
ErrorOutOfHostMemory
.P.S. 2. This error has only been seen a few times when I launch multiprocessing jobs. But in most cases, multiprocessing also just gets stuck wthout throwing an error.
P.S. 3. When I run with a single process, the bug can also occur (but never with this traceback; it's just frozen forever) but in that case it's very unlikely to go out of memory.
To Reproduce
take_picture
ends up in a deadlock when it tries to acquire some kind of resources from the GPU?Expected behavior
The
gym.make
doesn't get stuck. Or at least it should throw an error.Screenshots
No.
Additional context
I tried on H100 and L40. This bug can occur on both types.
The text was updated successfully, but these errors were encountered: