fix: modin dtype interoperability #1692

camriddell · 2024-12-31T19:22:47Z

What type of PR is this? (check all applicable)

Related issues

Semi-related PR feat: add is_nan expression & series method #1625

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

Narwhals failed to preserve pyarrow backed datatypes with Modin-native backed DataFrames. This issue was first noticed in the aforementioned PR and subsequent MRE

# modin.pandas Series with pyarrow backed dtype
>>> import modin.pandas as mpd
>>> s = mpd.Series([1.0, None]).convert_dtypes(dtype_backend='pyarrow')
>>> s
0       1
1    <NA>
dtype: int64[pyarrow]

>>> import narwhals as nw
>>> ns = nw.from_native(s, series_only=True)
>>> ns.cast(nw.Float64).to_native() # loses pyarrow backend info
0    1.0
1    NaN
dtype: float64

With this PR, we appropriately preserve the dtype back-end

>>> import modin.pandas as mpd
>>> s = mpd.Series([1.0, None]).convert_dtypes(dtype_backend='pyarrow')
>>> s
0       1
1    <NA>
dtype: int64[pyarrow]

>>> import narwhals as nw
>>> ns = nw.from_native(s, series_only=True)
>>> ns.cast(nw.Float64).to_native()
0     1.0
1    <NA>
dtype: double[pyarrow]

An edge case was encountered in narwhals/_pandas_like/group_by.py where .get_group(...) would raise a KeyError if the passed key contained float("nan") (as it does in the testing suite). This is likely due to some copying/reconstruction of the nan object which fails to reproduce its original hash since its hash is compute against the object id.

- preserve dtype backend when casting on modin natives - groupby backends handle own __iter__

MarcoGorelli

no better feeling than seeing a failing CI job and then reading

[XPASS(strict)]

😎 well done! this is awesome

MarcoGorelli · 2024-12-31T19:26:40Z

narwhals/_pandas_like/group_by.py

-        indices = self._grouped.indices
-        if (
-            self._df._implementation is Implementation.PANDAS
-            and self._df._backend_version < (2, 2)
-        ) or (
-            self._df._implementation is Implementation.CUDF
-            and self._df._backend_version < (2024, 12)
-        ):  # pragma: no cover
-            for key in indices:
-                yield (key, self._from_native_frame(self._grouped.get_group(key)))
-        else:
-            for key in indices:
-                key = tupleify(key)  # noqa: PLW2901
-                yield (key, self._from_native_frame(self._grouped.get_group(key)))
+        for key, group in self._grouped:
+            yield (key, self._from_native_frame(group))


MarcoGorelli

legend, thanks so much @camriddell !

i've got something in progress for #1690 which should hopefully improve things for all these backends in tests

camriddell added 2 commits December 31, 2024 10:11

fix untangle modin numpy & pyarrow dtypes

04d3d66

- preserve dtype backend when casting on modin natives - groupby backends handle own __iter__

enh split modin dtype backends in tests

70f0bff

MarcoGorelli reviewed Dec 31, 2024

View reviewed changes

camriddell added 4 commits December 31, 2024 12:53

fix modin tests to modin_pyarrow

f40255b

xfail pyarrow backed dates tzconvert on windows

cdacd8c

xfail groupby with nulls for pandas < 1.0

29a83ee

xfail pyarrow backed dates tzconvert on windows

0f52ab3

MarcoGorelli approved these changes Jan 1, 2025

View reviewed changes

MarcoGorelli merged commit 3124bf3 into narwhals-dev:main Jan 1, 2025
24 checks passed

MarcoGorelli added the fix label Jan 1, 2025

MarcoGorelli changed the title ~~enh modin dtype interoperability~~ fix: modin dtype interoperability Jan 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: modin dtype interoperability #1692

fix: modin dtype interoperability #1692

camriddell commented Dec 31, 2024

MarcoGorelli left a comment

MarcoGorelli Dec 31, 2024

MarcoGorelli left a comment

fix: modin dtype interoperability #1692

fix: modin dtype interoperability #1692

Conversation

camriddell commented Dec 31, 2024

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below

MarcoGorelli left a comment

Choose a reason for hiding this comment

MarcoGorelli Dec 31, 2024

Choose a reason for hiding this comment

MarcoGorelli left a comment

Choose a reason for hiding this comment