Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43683: [Python] Use pandas StringDtype when enabled (pandas 3+) #44195

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Sep 23, 2024

Rationale for this change

With pandas' PDEP-14 proposal, pandas is planning to introduce a default string dtype in pandas 3.0 (instead of the current object dtype).

This will become the default in pandas 3.0, and can be enabled with an option in the upcoming pandas 2.3 (pd.options.future.infer_string = True). To prepare for that, we should start using that string dtype in to_pandas() conversions when that option is enabled.

What changes are included in this PR?

  • If pandas >= 3.0 is used or the pandas option is enabled, ensure that to_pandas() calls use the default string dtype of pandas for string-like columns (string, large_string, string_view)

Are these changes tested?

It is tested in the pandas-nightly crossbow build.

There is still one failure that is because of a bug on the pandas side (pandas-dev/pandas#59879)

Are there any user-facing changes?

This PR includes breaking changes to public APIs. Depending on the version of pandas, to_pandas() will change to use pandas' string dtype instead of object dtype. This is a breaking user-facing change, but essentially just following the equivalent change in default dtype on the pandas side.

@jorisvandenbossche
Copy link
Member Author

@github-actions crossbow submit test-conda-python-3.11-pandas-nightly-numpy-nightly

Copy link

github-actions bot commented Nov 8, 2024

Revision: 56b61f2

Submitted crossbow builds: ursacomputing/crossbow @ actions-c85b742ef7

Task Status
test-conda-python-3.11-pandas-nightly-numpy-nightly GitHub Actions

e1 = pd.DataFrame(
{'a': a_values},
index=pd.RangeIndex(0, 8, step=2, name='qux'),
columns=pd.Index(['a'], dtype=object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the column type created with the dict argument differ from this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is specifically using old metadata that specifies the dtype of the columns is object dtype, and then pyarrow tries to restore it that way.

It's the question if we should do that though .. Because every file written from a pandas DataFrame before pandas 3.0 will have that, so maybe we should specifically ignore object dtype here if the inferred type is that it contains all strings, so users consistently get a columns Index object using str dtype

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that's tricky but I think going with the str data type as you suggested is better; I would expect that is a better UX in over 99% of instances

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, changed this to ensure we actually use str dtype columns Index object, even if the pandas metadata of the pyarrow table says that the original table was using object dtype.

This ensures that all existing files will use (with pandas>= 3) the default str dtype for the columns, but that also has the trade-off that if you explicitly want to use object dtype with strings, that this will no longer roundtrip in pandas->pyarrow/parquet->pandas)

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Nov 8, 2024
@jorisvandenbossche
Copy link
Member Author

@github-actions crossbow submit test-conda-python-3.11-pandas-nightly-numpy-nightly

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 13, 2024
Copy link

Revision: 84b8234

Submitted crossbow builds: ursacomputing/crossbow @ actions-ac3103d3ba

Task Status
test-conda-python-3.11-pandas-nightly-numpy-nightly GitHub Actions

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review November 13, 2024 09:43
@jorisvandenbossche
Copy link
Member Author

@github-actions crossbow submit test-conda-python-3.11-pandas-nightly-numpy-nightly

Copy link

Revision: e5db09f

Submitted crossbow builds: ursacomputing/crossbow @ actions-3c389cd49e

Task Status
test-conda-python-3.11-pandas-nightly-numpy-nightly GitHub Actions

@github-actions github-actions bot removed the awaiting change review Awaiting change review label Nov 13, 2024
@github-actions github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Dec 16, 2024
@jorisvandenbossche
Copy link
Member Author

@github-actions crossbow submit -g python

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 3, 2025
Copy link

github-actions bot commented Jan 3, 2025

Revision: 940b64d

Submitted crossbow builds: ursacomputing/crossbow @ actions-c96266afd8

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-cython2 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.10-substrait GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-1.1.3-numpy-1.19.5 GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-python-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-39-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jan 3, 2025
Copy link
Contributor

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @raulcd for missing your ping! I am also not very familiar with this part of the code so I have a few questions, although I generally trust Joris knows what he is doing here :-)

@@ -2523,7 +2523,8 @@ Status ConvertCategoricals(const PandasOptions& options, ChunkedArrayVector* arr
}
if (options.strings_to_categorical) {
for (int i = 0; i < static_cast<int>(arrays->size()); i++) {
if (is_base_binary_like((*arrays)[i]->type()->id())) {
if (is_base_binary_like((*arrays)[i]->type()->id()) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The binary_view changes are tangential to pandas 3.x right? I wonder if they shouldn't be done in their own PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat tangential yes, as in that it is also a useful change to do regardless of the other changes here. I am also not entirely sure there is a very specific test for this.

Happy to move out to a separate PR, although I then would like both to get merged for 19.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I would prefer to separate out so we can analyze test coverage too. I suppose that would also be helpful if these changes end up going through to different releases (although that's not the aim)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. it turns out to not actually affect the tests here, so good to move that to a separate PR with actual test coverage: #45176

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've merged the related issue and marked as 19.0.0

if name is not None and not isinstance(name, str):
if (
name is not None
and not (isinstance(name, float) and np.isnan(name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that np.nan is now a valid column name for the string data type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since this was essentially to support some missing values in the column names (but restricted to None), also allowing np.nan keeps somewhat the same behaviour when switching from object dtype to string dtype

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 9, 2025
@jorisvandenbossche jorisvandenbossche merged commit 5181c24 into apache:main Jan 9, 2025
36 checks passed
@jorisvandenbossche jorisvandenbossche removed the awaiting change review Awaiting change review label Jan 9, 2025
@jorisvandenbossche jorisvandenbossche deleted the gh-43683-pandas-string-dtype branch January 9, 2025 19:22
amoeba pushed a commit that referenced this pull request Jan 9, 2025
…44195)

### Rationale for this change

With pandas' [PDEP-14](https://pandas.pydata.org/pdeps/0014-string-dtype.html) proposal, pandas is planning to introduce a default string dtype in pandas 3.0 (instead of the current object dtype).

This will become the default in pandas 3.0, and can be enabled with an option in the upcoming pandas 2.3 (`pd.options.future.infer_string = True`). To prepare for that, we should start using that string dtype in `to_pandas()` conversions when that option is enabled.

### What changes are included in this PR?

- If pandas >= 3.0 is used or the pandas option is enabled, ensure that `to_pandas()` calls use the default string dtype of pandas for string-like columns (string, large_string, string_view)

### Are these changes tested?

It is tested in the pandas-nightly crossbow build.

There is still one failure that is because of a bug on the pandas side (pandas-dev/pandas#59879)

### Are there any user-facing changes?

**This PR includes breaking changes to public APIs.** Depending on the version of pandas, `to_pandas()` will change to use pandas' string dtype instead of object dtype. This is a breaking user-facing change, but essentially just following the equivalent change in default dtype on the pandas side.

* GitHub Issue: #43683

Lead-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
amoeba pushed a commit that referenced this pull request Jan 11, 2025
…44195)

### Rationale for this change

With pandas' [PDEP-14](https://pandas.pydata.org/pdeps/0014-string-dtype.html) proposal, pandas is planning to introduce a default string dtype in pandas 3.0 (instead of the current object dtype).

This will become the default in pandas 3.0, and can be enabled with an option in the upcoming pandas 2.3 (`pd.options.future.infer_string = True`). To prepare for that, we should start using that string dtype in `to_pandas()` conversions when that option is enabled.

### What changes are included in this PR?

- If pandas >= 3.0 is used or the pandas option is enabled, ensure that `to_pandas()` calls use the default string dtype of pandas for string-like columns (string, large_string, string_view)

### Are these changes tested?

It is tested in the pandas-nightly crossbow build.

There is still one failure that is because of a bug on the pandas side (pandas-dev/pandas#59879)

### Are there any user-facing changes?

**This PR includes breaking changes to public APIs.** Depending on the version of pandas, `to_pandas()` will change to use pandas' string dtype instead of object dtype. This is a breaking user-facing change, but essentially just following the equivalent change in default dtype on the pandas side.

* GitHub Issue: #43683

Lead-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 5181c24.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 13 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants