feat: add is_nan expression & series method #1625

camriddell · 2024-12-19T20:56:56Z

add support for pandas, arrow, dask
add to docs
add tests

What type of PR is this? (check all applicable)

Related issues

Related issue feat: is_nan #1583
Closes feat: is_nan #1583

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

The is_nan implementation for pandas and dask currently return True for the following conditions:

The backend datatype is float64
The NaN value is not equal to itself

This leaves space for false rejections when there is a NaN value inside of a Series with 'object' dtype, though I am uncertain if we care about this edgecase.

Additionally, I was uncertain on how to structure the tests with regards to having NaN values inside of nullable datatypes. I currently dynamically change the expected result but was considering creating separate tests for data that are backed by a nullable datatype vs a non-nullable datatype so some guidance here would be appreciated.

- add support for pandas, arrow, dask - add to docs - add tests

MarcoGorelli

nice, thanks @camriddell !

in answer to your question on the issue, i think np.isnan might be fine too, I'd just check that it doesn't inadvertently end up copying data just to do the operation (whereas s != s shouldn't be too expensive)

tests look good to me

docs/api-reference/expr.md

MarcoGorelli · 2024-12-19T21:03:29Z

narwhals/expr.py

+
+            Let's define a dataframe-agnostic function:
+
+            >>> def my_library_agnostic_function(df_native: IntoFrameT) -> IntoFrameT:


#1500 is still in progress, but as we're adding a new function, shall we follow the conventions there?

narwhals/expr.py

MarcoGorelli · 2024-12-19T21:06:32Z

narwhals/series.py

+
+            We define a dataframe-agnostic function:
+
+            >>> def my_library_agnostic_function(s_native: IntoSeriesT) -> IntoSeriesT:


narwhals/series.py

camriddell · 2024-12-20T16:13:17Z

@MarcoGorelli I realized Polars raises an InvalidOperationError for checking is_nan for non supported dtypes. Is this a behavior we should enforce on all other backends (pandas/dask) as well?

>>> import polars as pl
>>> pl.Series(["a", "b"]).is_nan()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/polars/series/utils.py", line 106, in wrapper
    return s.to_frame().select_seq(f(*args, **kwargs)).to_series()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py", line 9138, in select_seq
    return self.lazy().select_seq(*exprs, **named_exprs).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 2029, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: `is_nan` operation not supported for dtype `str`

Should we also raise an exception in these cases? Right now we are returning a Series of False values (though I do need to update this to respect existing Nulls).

Some additional testing code.

import polars as pl
from datetime import date

data = [[0], [1.1], ["a"], [date.today()], [[0]], [["a"]], [False]]
cant_check = []
for d in data:
    s = pl.Series(d)
    try:
        s.is_nan()
    except pl.exceptions.InvalidOperationError:
        print(f"\N{cross mark} {s.dtype = }")
    else:
        print(f"\N{white heavy check mark} {s.dtype = }")

# ✅ s.dtype = Int64
# ✅ s.dtype = Float64
# ❌ s.dtype = String
# ❌ s.dtype = Date
# ❌ s.dtype = List(Int64)
# ❌ s.dtype = List(String)
# ❌ s.dtype = Boolean

PyArrow exhibits the same behavior

>>> import pyarrow as pa
>>> s = pa.array(["1","2"])
>>> pc.is_nan(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pyarrow/compute.py", line 247, in wrapper
    return func.call(args, None, memory_pool)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_compute.pyx", line 393, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Function 'is_nan' has no kernel matching input types (string)

MarcoGorelli · 2024-12-21T09:31:24Z

interesting, thanks - yeah that's probably a good idea. then we avoid the risk that people who are too used to pandas accidentally use is_nan in non-float columns in places where they really wanted is_null

camriddell · 2024-12-27T22:13:59Z

@MarcoGorelli thoughts on the apparent inconsistency in Polars is_nan handling? In Polars, if the underlying dtype is Int64, then a is_nan converts the Null value → False. However if the dtypes is Float64 then the Null is preserved.

import pandas as pd
import polars as pl
import pyarrow as pa
import narwhals as nw
from narwhals.typing import IntoFrameT
data = {"a": [2, None, 4, 5, 6], "b": [2.0, None, float("nan"), 5.0, 6.0]}
df_pd = pd.DataFrame(data).astype({"a": "Int64"})
df_pl = pl.DataFrame(data)
df_pa = pa.table(data)

def agnostic_is_nan_columns(df_native: IntoFrameT) -> IntoFrameT:
    df = nw.from_native(df_native)
    return df.with_columns(
        a_is_nan=nw.col("a").is_nan(), b_is_nan=nw.col("b").is_nan()
    ).to_native()


print(agnostic_is_nan_columns(df_pd))
#       a    b  a_is_nan  b_is_nan
# 0     2  2.0     False     False
# 1  <NA>  NaN      <NA>      True
# 2     4  NaN     False      True
# 3     5  5.0     False     False
# 4     6  6.0     False     False

print(agnostic_is_nan_columns(df_pl))
# shape: (5, 4)
# ┌──────┬──────┬──────────┬──────────┐
# │ a    ┆ b    ┆ a_is_nan ┆ b_is_nan │
# │ ---  ┆ ---  ┆ ---      ┆ ---      │
# │ i64  ┆ f64  ┆ bool     ┆ bool     │
# ╞══════╪══════╪══════════╪══════════╡
# │ 2    ┆ 2.0  ┆ false    ┆ false    │
# │ null ┆ null ┆ false    ┆ null     │
# │ 4    ┆ NaN  ┆ false    ┆ true     │
# │ 5    ┆ 5.0  ┆ false    ┆ false    │
# │ 6    ┆ 6.0  ┆ false    ┆ false    │
# └──────┴──────┴──────────┴──────────┘

print(agnostic_is_nan_columns(df_pa))
# pyarrow.Table
# a: int64
# b: double
# a_is_nan: bool
# b_is_nan: bool
# ----
# a: [[2,null,4,5,6]]
# b: [[2,null,nan,5,6]]
# a_is_nan: [[false,null,false,false,false]]
# b_is_nan: [[false,null,true,false,false]]

MRE for the Polars example

>>> import polars as pl
>>> pl.__version__
'1.16.0'
>>> pl.DataFrame({"int": [None, 1], "float": [None, 1.0]}).with_columns(pl.all().is_nan().name.suffix("_is_nan"))
shape: (2, 4)
┌──────┬───────┬────────────┬──────────────┐
│ int  ┆ float ┆ int_is_nan ┆ float_is_nan │
│ ---  ┆ ---   ┆ ---        ┆ ---          │
│ i64  ┆ f64   ┆ bool       ┆ bool         │
╞══════╪═══════╪════════════╪══════════════╡
│ null ┆ null  ┆ false      ┆ null         │
│ 1    ┆ 1.0   ┆ false      ┆ false        │
└──────┴───────┴────────────┴──────────────┘

camriddell · 2024-12-27T22:16:39Z

narwhals/_pandas_like/series.py

+        ser = self._native_series
+        if self.dtype.is_numeric():
+            return self._from_native_series(ser != ser)  # noqa: PLR0124
+        msg = f"`is_nan` is not supported for dtype {self.dtype}"


Should this error message be extended to include a suggestion?

"is_nan is not supported for dtype {self.dtype}, did you mean to use is_null?

lots of love for kind error messages 💚

MarcoGorelli · 2024-12-28T21:41:32Z

thanks for spotting this! I think this looks like a bug in Polars, shall we report it there? I think either is_nan should only be supported on float columns (my preference), or the null preservation should be consistent

camriddell · 2024-12-30T15:00:37Z

Yep- it appears that this bug was noticed in this Polars PR: pola-rs/polars#15889 but not fully addressed until a broader change (apparently all float-specific operations failed to propagate nulls) was made last week for the 1.18 release: pola-rs/polars#20386.

I think only allowing .is_nan on the float dtype makes sense here, this will ensure operations are consistent across all backends without us creating a version aware is_nan implementation for Polars to smooth out the API differences. I'll go ahead an implement this approach and hopefully wrap this on up!

MarcoGorelli

thanks Cam, awesome stuff! just some minor things

narwhals/_dask/expr.py

narwhals/expr.py

narwhals/series.py

camriddell · 2024-12-30T20:07:41Z

There was a failure for the Modin constructor, which seems to improperly? cast from a pyarrow-nullable backend to a numpy backend. Should

narwhals/narwhals/_pandas_like/utils.py

Lines 483 to 498 in 7c8c6fa

    
           def get_dtype_backend(dtype: Any, implementation: Implementation) -> str: 
        
               if implementation is Implementation.PANDAS: 
        
                   import pandas as pd 
        
                   if hasattr(pd, "ArrowDtype") and isinstance(dtype, pd.ArrowDtype): 
        
                       return "pyarrow-nullable" 
        
                   try: 
        
                       if isinstance(dtype, pd.core.dtypes.dtypes.BaseMaskedDtype): 
        
                           return "pandas-nullable" 
        
                   except AttributeError:  # pragma: no cover 
        
                       # defensive check for old pandas versions 
        
                       pass 
        
                   return "numpy" 
        
               else:  # pragma: no cover 
        
                   return "numpy"

also have a clause for if the implementation is modin as well? Perhaps something like this:

def get_dtype_backend(dtype: Any, implementation: Implementation) -> str:
    if any(implementation is  impl for impl in [Implementation.PANDAS, Implementation.MODIN]):
        import pandas as pd

        if hasattr(pd, "ArrowDtype") and isinstance(dtype, pd.ArrowDtype):
            return "pyarrow-nullable"

    if implementation is Implementation.PANDAS:
        try:
            if isinstance(dtype, pd.core.dtypes.dtypes.BaseMaskedDtype):
                return "pandas-nullable"
        except AttributeError:  # pragma: no cover
            # defensive check for old pandas versions
            pass
        return "numpy"
    else:  # pragma: no cover
        return "numpy"

Should there be any cases here for CuDF as well?

relevant MRE

# modin.pandas Series with pyarrow backed dtype
>>> import modin.pandas as mpd
>>> s = mpd.Series([1.0, None]).convert_dtypes(dtype_backend='pyarrow')
>>> s
0       1
1    <NA>
dtype: int64[pyarrow]

>>> import narwhals as nw
>>> ns = nw.from_native(s, series_only=True)
>>> ns.cast(nw.Float64).to_native() # loses pyarrow backend info
0    1.0
1    NaN
dtype: float64

>>> from narwhals._pandas_like.utils import get_dtype_backend
>>> get_dtype_backend(s.dtype, nw.Implementation.MODIN) 
'numpy'
>>> get_dtype_backend(s.dtype, nw.Implementation.PANDAS) # correctly returns pyarrow-nullable
'pyarrow-nullable'

- link out to pandas_like_concepts/null_handling - remove stale raises - fix returns for Series to be more specific

MarcoGorelli · 2024-12-30T20:55:57Z

well-spotted, thanks! we should probably do the same for modin then

cuDF doesn't seem to have these different dtype backends (it's always nullable arrow dtypes)

camriddell · 2024-12-30T21:18:41Z

well-spotted, thanks! we should probably do the same for modin then

cuDF doesn't seem to have these different dtype backends (it's always nullable arrow dtypes)

Should the patch be a part of this PR or a new one since it is a semi-related fix?

MarcoGorelli · 2024-12-31T07:40:56Z

either is fine, whichever workflow works best for you

…-isnan

MarcoGorelli

thanks @camriddell !

narwhals/_dask/expr.py

add is_nan expression & series method

df1feb5

- add support for pandas, arrow, dask - add to docs - add tests

MarcoGorelli reviewed Dec 19, 2024

View reviewed changes

alphabetize is_nan in ref/compl docs

b6c53fc

camriddell added 4 commits December 27, 2024 11:25

clean up is_nan examples

ec3b3d5

error for is_nan non-numeric dtypes

ad95bf7

fix doctest expr

d864063

is_nan tests better coverage

bacd81f

camriddell commented Dec 27, 2024

View reviewed changes

camriddell and others added 4 commits December 30, 2024 10:03

Merge branch 'main' into feat-isnan

aa6e6da

fix is_nan documentation examples

3c7be12

is_nan only works on numeric dtypes all backends

a2b225b

enh testing for is_nan

0b58080

MarcoGorelli reviewed Dec 30, 2024

View reviewed changes

narwhals/_dask/expr.py Show resolved Hide resolved

narwhals/expr.py Outdated Show resolved Hide resolved

narwhals/expr.py Outdated Show resolved Hide resolved

narwhals/series.py Outdated Show resolved Hide resolved

narwhals/series.py Outdated Show resolved Hide resolved

enh is_nan documentation

b74e65c

- link out to pandas_like_concepts/null_handling - remove stale raises - fix returns for Series to be more specific

camriddell mentioned this pull request Dec 31, 2024

fix: modin dtype interoperability #1692

Merged

10 tasks

camriddell and others added 6 commits January 1, 2025 11:08

Merge branch 'main' into feat-isnan

2e4bf8a

Merge branch 'main' into feat-isnan

8e7abb2

Merge branch 'main' into feat-isnan

c3289c7

Merge branch 'feat-isnan' of github.com:camriddell/narwhals into feat…

4ebbd28

…-isnan

Merge branch 'main' of github.com:narwhals-dev/narwhals into feat-isnan

e274228

add modin to is_nan tests

266894b

fix modin in is_nan tests

2716ed9

MarcoGorelli approved these changes Jan 3, 2025

View reviewed changes

narwhals/_dask/expr.py Outdated Show resolved Hide resolved

Update narwhals/_dask/expr.py

a7ea65a

MarcoGorelli added the enhancement New feature or request label Jan 3, 2025

MarcoGorelli merged commit 1b3196b into narwhals-dev:main Jan 3, 2025
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add is_nan expression & series method #1625

feat: add is_nan expression & series method #1625

camriddell commented Dec 19, 2024

MarcoGorelli left a comment

MarcoGorelli Dec 19, 2024

MarcoGorelli Dec 19, 2024

camriddell commented Dec 20, 2024 •

edited

Loading

MarcoGorelli commented Dec 21, 2024

camriddell commented Dec 27, 2024 •

edited

Loading

camriddell Dec 27, 2024

MarcoGorelli Dec 28, 2024

MarcoGorelli commented Dec 28, 2024

camriddell commented Dec 30, 2024 •

edited

Loading

MarcoGorelli left a comment

camriddell commented Dec 30, 2024 •

edited

Loading

MarcoGorelli commented Dec 30, 2024

camriddell commented Dec 30, 2024

MarcoGorelli commented Dec 31, 2024

MarcoGorelli left a comment


		Let's define a dataframe-agnostic function:

		>>> def my_library_agnostic_function(df_native: IntoFrameT) -> IntoFrameT:


		We define a dataframe-agnostic function:

		>>> def my_library_agnostic_function(s_native: IntoSeriesT) -> IntoSeriesT:

feat: add is_nan expression & series method #1625

feat: add is_nan expression & series method #1625

Conversation

camriddell commented Dec 19, 2024

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below

MarcoGorelli left a comment

Choose a reason for hiding this comment

MarcoGorelli Dec 19, 2024

Choose a reason for hiding this comment

MarcoGorelli Dec 19, 2024

Choose a reason for hiding this comment

camriddell commented Dec 20, 2024 • edited Loading

MarcoGorelli commented Dec 21, 2024

camriddell commented Dec 27, 2024 • edited Loading

camriddell Dec 27, 2024

Choose a reason for hiding this comment

MarcoGorelli Dec 28, 2024

Choose a reason for hiding this comment

MarcoGorelli commented Dec 28, 2024

camriddell commented Dec 30, 2024 • edited Loading

MarcoGorelli left a comment

Choose a reason for hiding this comment

camriddell commented Dec 30, 2024 • edited Loading

MarcoGorelli commented Dec 30, 2024

camriddell commented Dec 30, 2024

MarcoGorelli commented Dec 31, 2024

MarcoGorelli left a comment

Choose a reason for hiding this comment

camriddell commented Dec 20, 2024 •

edited

Loading

camriddell commented Dec 27, 2024 •

edited

Loading

camriddell commented Dec 30, 2024 •

edited

Loading

camriddell commented Dec 30, 2024 •

edited

Loading