Support loading strings with the pyarrow backend to use the `string[pyarrow]` pandas dtype #279

smcguire-cmu · 2024-04-17T22:29:17Z

Currently for string columns, pandas will load the strings as native python strings, and dask will then create a slow task to convert them all to pyarrow strings. Pandas has recently introduced support for the pyarrow string dtype, and can load strings from parquet files directly into a pandas df with the pyarrow string type by specifying dtype_backend="pyarrow" as an option in the pd.read_parquet call.

We support passing kwargs to this function, but when generating the dask meta DataFrame from the parquet schema, we don't use pyarrow string types, and so we get a meta mismatch. So this needs to be updated here, and tested that the new dtype works with the other from_delayed functions for operations like crossmatching and joining where we generate the meta.

The text was updated successfully, but these errors were encountered:

delucchi-cmu added this to HATS / LSDB Apr 17, 2024

nevencaplar moved this to Todo in HATS / LSDB Apr 18, 2024

smcguire-cmu moved this from Todo to In Progress in HATS / LSDB Apr 19, 2024

smcguire-cmu self-assigned this Apr 19, 2024

smcguire-cmu moved this from In Progress to Todo in HATS / LSDB Apr 19, 2024

smcguire-cmu mentioned this issue Apr 23, 2024

Add support for Dask versions >=2024.3.0 with dask expressions #288

Merged

smcguire-cmu moved this from Todo to In Progress in HATS / LSDB Apr 24, 2024

camposandro mentioned this issue May 6, 2024

Load data using pyarrow types #306

Merged

camposandro closed this as completed in #306 May 9, 2024

github-project-automation bot moved this from In Progress to Done in HATS / LSDB May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support loading strings with the pyarrow backend to use the `string[pyarrow]` pandas dtype #279

Support loading strings with the pyarrow backend to use the `string[pyarrow]` pandas dtype #279

smcguire-cmu commented Apr 17, 2024

Support loading strings with the pyarrow backend to use the string[pyarrow] pandas dtype #279

Support loading strings with the pyarrow backend to use the string[pyarrow] pandas dtype #279

Comments

smcguire-cmu commented Apr 17, 2024

Support loading strings with the pyarrow backend to use the `string[pyarrow]` pandas dtype #279

Support loading strings with the pyarrow backend to use the `string[pyarrow]` pandas dtype #279