-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible deadlock in tensorstore #206
Comments
The stacktrace indicates that your process is stuck waiting for the async tensorstore operation to complete. Can you get a backtrace of the non-python other threads? The only thing that stands out so far is that you have concurrency set to 1 with 0 bytes for the cache pool. Edit: I tried to setup a long-running benchmark-style test of this and couldn't get it to trigger. Additional debugging is required to determine what's going on here. |
Hey, Been also experiencing this issue when using forking with accessing a tensorstore dataset. Same issue with infinite hanging on Would love an update! |
Forking is known not to work because TensorStore uses multiple threads internally. That is unfortunately not something that can be fixed. Instead you need to ensure forking happens before starting any threads, i.e. before doing anything with TensorStore. |
Ah ok! That makes sense, thanks for the quick reply! |
I've been encountering deadlocks when using tensorstore. I'm posting this issue somewhat reluctantly because I'm not 100% that tensorstore is to blame. If you have any thoughts or comments, let me know.
(BTW, I am using Linux, python
3.12.6
, and tensorstore0.1.67
. I see that the current version is0.1.69
, so I'll try upgrading.)In my particular use-case, I'm exporting a large array from a bespoke database into a sharded
precomputed
volume. I'm using a cluster, but I'm careful to make sure that my workers' tasks are aligned to the shard shape. In addition to writing the shards, occasionally I do have to read from the volume.After running for a few hours, my code deadlocked. After inspecting all thread stacks for all Python workers, only one appeared problematic: it was stuck in a
tensorstore
function. (All other threads were just waiting in their base worker eventloop, waiting for new tasks.)The particular line of code it was stuck on is shown below, which in which it happens to be reading:
...where
store
had been previously initialized via:...using the following
spec
andcontext
configuration:context and spec
To see if I could drill down a bit more, I attached to the running process with
gdb
and obtained the backtrace for the relevant thread, shown below. This seems to indicate that it's stuck inGetResult()
, but I can't say much more than that.gdb backtrace
The text was updated successfully, but these errors were encountered: