Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support loading strings with the pyarrow backend to use the string[pyarrow] pandas dtype #279

Closed
smcguire-cmu opened this issue Apr 17, 2024 · 0 comments · Fixed by #306
Closed
Assignees

Comments

@smcguire-cmu
Copy link
Contributor

Currently for string columns, pandas will load the strings as native python strings, and dask will then create a slow task to convert them all to pyarrow strings. Pandas has recently introduced support for the pyarrow string dtype, and can load strings from parquet files directly into a pandas df with the pyarrow string type by specifying dtype_backend="pyarrow" as an option in the pd.read_parquet call.

We support passing kwargs to this function, but when generating the dask meta DataFrame from the parquet schema, we don't use pyarrow string types, and so we get a meta mismatch. So this needs to be updated here, and tested that the new dtype works with the other from_delayed functions for operations like crossmatching and joining where we generate the meta.

@nevencaplar nevencaplar moved this to Todo in HATS / LSDB Apr 18, 2024
@smcguire-cmu smcguire-cmu moved this from Todo to In Progress in HATS / LSDB Apr 19, 2024
@smcguire-cmu smcguire-cmu self-assigned this Apr 19, 2024
@smcguire-cmu smcguire-cmu moved this from In Progress to Todo in HATS / LSDB Apr 19, 2024
@smcguire-cmu smcguire-cmu moved this from Todo to In Progress in HATS / LSDB Apr 24, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in HATS / LSDB May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant