-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] XLA Segmentation Fault #283
Comments
Do you mean that if this line is remove then the crash goes away? |
Initially I thought so but then looks like it's flaky. But I guess you have found the issue? |
Yep, #284 should fix it. |
This closes #283 The `XlaSend` call requires `envpool` to make a copy of the `action` to prevent `action` from being recycled by the XLA runtime before `envpool` finishes using it. Originally, I used `cudaMemcpy` to make sure the copy was finished synchronously. However, it seems to cause a problem with issue #283. Here, I replace the original `cudaMemcpy` call with the async version, and an explicit `streamSynchronize`. It is not clear how `cudaMemcpy` in the default stream in a custom call interacts with the stream managed by pjrt. However, from the code [here](https://github.com/tensorflow/tensorflow/blob/0d2d79e84c9bdf71c737ad17a7b1dc04d9efc24f/tensorflow/compiler/xla/g3doc/custom_call.md), I can hypothesize that an explicit stream synchronization in the custom call is safe.
@mavenlin Hmm I tried it on my side but that issue seems to persist, I will take a closer look at the setup on my side. |
I tested the wheel from here. I can run your above code without an issue. |
Yeah it seems to work. I was experimenting with PDM and that seems to have messed up my pip installation somehow. Many thanks for fixing this! It would be super cool if there is a new release on PyPI. |
will do this weekend, sorry for the delay |
done, pip install envpool will now use 0.8.4 |
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
The following code using the XLA interface crashes when running on the GPU.
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
System info
Describe the characteristic of your environment:
JAX 0.4.10.
Additional context
I ran under gdb, this is the backtrace
Reason and Possible fixes
If you know or suspect the reason for this bug, paste the code lines and suggest modifications.
Checklist
The text was updated successfully, but these errors were encountered: