-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load data using pyarrow types #306
Conversation
Click here to view all benchmarks. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #306 +/- ##
==========================================
+ Coverage 99.06% 99.08% +0.02%
==========================================
Files 41 41
Lines 1277 1306 +29
==========================================
+ Hits 1265 1294 +29
Misses 12 12 ☔ View full report in Codecov by Sentry. |
Does it also fix #89? |
Yes, the string conversion tasks should not be required for catalogs loaded with pyarrow types. I created a notebook that loads a chunk of ZTF sources with the "band" information, of string-type (we saw several bottlenecks happening with this column before). Using a single worker and the latest version of Dask (2024.5.0) these were the compute times I obtained: Previously (w/ pyarrow string conversion): 181.35 s |
Co-authored-by: Melissa DeLucchi <[email protected]>
Adds the
use_pyarrow_types
argument to theread_hipscat
interface. This allows us to pass thedtype_backend
to the calls that read the parquet leaf files, as well as metadata, loading the data and respective schema with pyarrow types.Using the pyarrow backend (which is now the default!) we can load strings directly into a pandas dataframe with the pyarrow string type, avoiding the creation of slow tasks to convert python strings into pyarrow strings.
This change required me to also address #303, allowing catalogs to be created with pyarrow types in
from_dataframe
.Closes #279 and #303.