-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration with object_store
crate
#3
Comments
After a deeper look into the library than my original glance afforded, I agree that object_store seems to include everything that an rfsspec would need. I'll see what it takes to integrate (help appreciated!). Critically, we will want to allow for the full set of credential options currently supported by s3fs, gcsfs and adlfs. Of course, HTTP is the easiest. Some thoughts:
|
I believe object_store is a relatively young library (and I haven't used it much myself) so it probably makes sense to start an issue over there (in the arrow-rs repo) to ask some of your questions. I've been wondering whether object_store considers it in-scope to include more filesystem apis like fsspec, so I'll write up an issue (hopefully today) and cc you
Yes! It's set up to be able to compile to wasm: https://github.com/apache/arrow-rs/tree/master/object_store#support-for-wasm32-unknown-unknown-target. As noted there, currently cloud integrations are turned off in wasm; not sure what the underlying complications are for those. See also this PR: apache/arrow-rs#2896 |
👋 object_store maintainer here
The object_store crate is focused on the APIs that object stores can efficiently provide, on the basis that this is what 99% of workloads actually need, functionality such as directories, random access reads, etc... would therefore be considered out of scope. That being said, I don't see a reason why object_store couldn't be used as the basis for a fsspec style implementation, with the unavoidable caveat that treating object stores as a filesystem requires prefetching heuristics, and generally does not yield the best experience. See apache/datafusion#2205 (comment) and apache/arrow-rs#1473 for more context on this if you're interested.
I seem to remember some limitation of tokio's networking support, it was something at that level as opposed to something inherent to object_store itself. |
... as I was in the midst of writing an
I 100% agree that this is very tricky (I'll save those issues for later reading!). In my eyes the style of fsspec is to expose this to the user for them to choose how they want to handle this. fsspec in Python includes a variety of caching mechanisms that the user can add as they wish https://github.com/fsspec/filesystem_spec/blob/master/fsspec/caching.py
I'm particularly interested myself in random access reads with an underlying block cache. I think something like fsspec in rust would be useful, and building on top of object_store would certainly make things easier. |
@martindurant what are your plans for this repo? Do you want this to only be a Python integration? Are you interested in a public Rust filesystem API that is also integrated into Python? If you're focused mostly on speeding up fsspec in Python, a rust filesystem API might be out of scope? |
The TLDR is that workload agnostic caching layers don't perform very well, DataBricks built their own integrated S3 reader for Spark, and the Hadoop ecosystem is working on adding vectored IO that maps better to the underlying object store requests |
Makes a lot of sense. I suppose it would be important to make such an fsspec caching layer an externally-implementable trait |
👋 Hi! I'm a fan of
fsspec
in python and happy to see you working on it in Rust as well.I wanted to make sure you were aware of the
object_store
crate, because I see that as an existing implementation offsspec
in Rust, and you appear to be re-implementingfsspec
from scratch here. Connecting that to python might save some work?The text was updated successfully, but these errors were encountered: