Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: modin dtype interoperability #1692

Merged
merged 6 commits into from
Jan 1, 2025

Conversation

camriddell
Copy link
Contributor

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

Narwhals failed to preserve pyarrow backed datatypes with Modin-native backed DataFrames. This issue was first noticed in the aforementioned PR and subsequent MRE

# modin.pandas Series with pyarrow backed dtype
>>> import modin.pandas as mpd
>>> s = mpd.Series([1.0, None]).convert_dtypes(dtype_backend='pyarrow')
>>> s
0       1
1    <NA>
dtype: int64[pyarrow]

>>> import narwhals as nw
>>> ns = nw.from_native(s, series_only=True)
>>> ns.cast(nw.Float64).to_native() # loses pyarrow backend info
0    1.0
1    NaN
dtype: float64

With this PR, we appropriately preserve the dtype back-end

>>> import modin.pandas as mpd
>>> s = mpd.Series([1.0, None]).convert_dtypes(dtype_backend='pyarrow')
>>> s
0       1
1    <NA>
dtype: int64[pyarrow]

>>> import narwhals as nw
>>> ns = nw.from_native(s, series_only=True)
>>> ns.cast(nw.Float64).to_native()
0     1.0
1    <NA>
dtype: double[pyarrow]

An edge case was encountered in narwhals/_pandas_like/group_by.py where .get_group(...) would raise a KeyError if the passed key contained float("nan") (as it does in the testing suite). This is likely due to some copying/reconstruction of the nan object which fails to reproduce its original hash since its hash is compute against the object id.

- preserve dtype backend when casting on modin natives
- groupby backends handle own __iter__
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no better feeling than seeing a failing CI job and then reading

[XPASS(strict)] 

😎 well done! this is awesome

indices = self._grouped.indices
if (
self._df._implementation is Implementation.PANDAS
and self._df._backend_version < (2, 2)
) or (
self._df._implementation is Implementation.CUDF
and self._df._backend_version < (2024, 12)
): # pragma: no cover
for key in indices:
yield (key, self._from_native_frame(self._grouped.get_group(key)))
else:
for key in indices:
key = tupleify(key) # noqa: PLW2901
yield (key, self._from_native_frame(self._grouped.get_group(key)))
for key, group in self._grouped:
yield (key, self._from_native_frame(group))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

legend, thanks so much @camriddell !

i've got something in progress for #1690 which should hopefully improve things for all these backends in tests

@MarcoGorelli MarcoGorelli merged commit 3124bf3 into narwhals-dev:main Jan 1, 2025
24 checks passed
@MarcoGorelli MarcoGorelli changed the title enh modin dtype interoperability fix: modin dtype interoperability Jan 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants