-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43683: [Python] Use pandas StringDtype when enabled (pandas 3+) #44195
GH-43683: [Python] Use pandas StringDtype when enabled (pandas 3+) #44195
Conversation
@github-actions crossbow submit test-conda-python-3.11-pandas-nightly-numpy-nightly |
Revision: 56b61f2 Submitted crossbow builds: ursacomputing/crossbow @ actions-c85b742ef7
|
python/pyarrow/tests/test_pandas.py
Outdated
e1 = pd.DataFrame( | ||
{'a': a_values}, | ||
index=pd.RangeIndex(0, 8, step=2, name='qux'), | ||
columns=pd.Index(['a'], dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the column type created with the dict argument differ from this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is specifically using old metadata that specifies the dtype of the columns is object dtype, and then pyarrow tries to restore it that way.
It's the question if we should do that though .. Because every file written from a pandas DataFrame before pandas 3.0 will have that, so maybe we should specifically ignore object dtype here if the inferred type is that it contains all strings, so users consistently get a columns Index object using str
dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that's tricky but I think going with the str data type as you suggested is better; I would expect that is a better UX in over 99% of instances
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, changed this to ensure we actually use str
dtype columns Index object, even if the pandas metadata of the pyarrow table says that the original table was using object dtype.
This ensures that all existing files will use (with pandas>= 3) the default str dtype for the columns, but that also has the trade-off that if you explicitly want to use object dtype with strings, that this will no longer roundtrip in pandas->pyarrow/parquet->pandas)
@github-actions crossbow submit test-conda-python-3.11-pandas-nightly-numpy-nightly |
Revision: 84b8234 Submitted crossbow builds: ursacomputing/crossbow @ actions-ac3103d3ba
|
@github-actions crossbow submit test-conda-python-3.11-pandas-nightly-numpy-nightly |
Revision: e5db09f Submitted crossbow builds: ursacomputing/crossbow @ actions-3c389cd49e
|
…as-string-dtype
@github-actions crossbow submit -g python |
Revision: 940b64d Submitted crossbow builds: ursacomputing/crossbow @ actions-c96266afd8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @raulcd for missing your ping! I am also not very familiar with this part of the code so I have a few questions, although I generally trust Joris knows what he is doing here :-)
@@ -2523,7 +2523,8 @@ Status ConvertCategoricals(const PandasOptions& options, ChunkedArrayVector* arr | |||
} | |||
if (options.strings_to_categorical) { | |||
for (int i = 0; i < static_cast<int>(arrays->size()); i++) { | |||
if (is_base_binary_like((*arrays)[i]->type()->id())) { | |||
if (is_base_binary_like((*arrays)[i]->type()->id()) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The binary_view changes are tangential to pandas 3.x right? I wonder if they shouldn't be done in their own PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat tangential yes, as in that it is also a useful change to do regardless of the other changes here. I am also not entirely sure there is a very specific test for this.
Happy to move out to a separate PR, although I then would like both to get merged for 19.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I would prefer to separate out so we can analyze test coverage too. I suppose that would also be helpful if these changes end up going through to different releases (although that's not the aim)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. it turns out to not actually affect the tests here, so good to move that to a separate PR with actual test coverage: #45176
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've merged the related issue and marked as 19.0.0
if name is not None and not isinstance(name, str): | ||
if ( | ||
name is not None | ||
and not (isinstance(name, float) and np.isnan(name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that np.nan is now a valid column name for the string data type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, since this was essentially to support some missing values in the column names (but restricted to None), also allowing np.nan keeps somewhat the same behaviour when switching from object dtype to string dtype
…as-string-dtype
…44195) ### Rationale for this change With pandas' [PDEP-14](https://pandas.pydata.org/pdeps/0014-string-dtype.html) proposal, pandas is planning to introduce a default string dtype in pandas 3.0 (instead of the current object dtype). This will become the default in pandas 3.0, and can be enabled with an option in the upcoming pandas 2.3 (`pd.options.future.infer_string = True`). To prepare for that, we should start using that string dtype in `to_pandas()` conversions when that option is enabled. ### What changes are included in this PR? - If pandas >= 3.0 is used or the pandas option is enabled, ensure that `to_pandas()` calls use the default string dtype of pandas for string-like columns (string, large_string, string_view) ### Are these changes tested? It is tested in the pandas-nightly crossbow build. There is still one failure that is because of a bug on the pandas side (pandas-dev/pandas#59879) ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** Depending on the version of pandas, `to_pandas()` will change to use pandas' string dtype instead of object dtype. This is a breaking user-facing change, but essentially just following the equivalent change in default dtype on the pandas side. * GitHub Issue: #43683 Lead-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Raúl Cumplido <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…44195) ### Rationale for this change With pandas' [PDEP-14](https://pandas.pydata.org/pdeps/0014-string-dtype.html) proposal, pandas is planning to introduce a default string dtype in pandas 3.0 (instead of the current object dtype). This will become the default in pandas 3.0, and can be enabled with an option in the upcoming pandas 2.3 (`pd.options.future.infer_string = True`). To prepare for that, we should start using that string dtype in `to_pandas()` conversions when that option is enabled. ### What changes are included in this PR? - If pandas >= 3.0 is used or the pandas option is enabled, ensure that `to_pandas()` calls use the default string dtype of pandas for string-like columns (string, large_string, string_view) ### Are these changes tested? It is tested in the pandas-nightly crossbow build. There is still one failure that is because of a bug on the pandas side (pandas-dev/pandas#59879) ### Are there any user-facing changes? **This PR includes breaking changes to public APIs.** Depending on the version of pandas, `to_pandas()` will change to use pandas' string dtype instead of object dtype. This is a breaking user-facing change, but essentially just following the equivalent change in default dtype on the pandas side. * GitHub Issue: #43683 Lead-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Raúl Cumplido <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 5181c24. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 13 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
With pandas' PDEP-14 proposal, pandas is planning to introduce a default string dtype in pandas 3.0 (instead of the current object dtype).
This will become the default in pandas 3.0, and can be enabled with an option in the upcoming pandas 2.3 (
pd.options.future.infer_string = True
). To prepare for that, we should start using that string dtype into_pandas()
conversions when that option is enabled.What changes are included in this PR?
to_pandas()
calls use the default string dtype of pandas for string-like columns (string, large_string, string_view)Are these changes tested?
It is tested in the pandas-nightly crossbow build.
There is still one failure that is because of a bug on the pandas side (pandas-dev/pandas#59879)
Are there any user-facing changes?
This PR includes breaking changes to public APIs. Depending on the version of pandas,
to_pandas()
will change to use pandas' string dtype instead of object dtype. This is a breaking user-facing change, but essentially just following the equivalent change in default dtype on the pandas side.