-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add is_nan expression & series method #1625
Conversation
- add support for pandas, arrow, dask - add to docs - add tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, thanks @camriddell !
in answer to your question on the issue, i think np.isnan
might be fine too, I'd just check that it doesn't inadvertently end up copying data just to do the operation (whereas s != s
shouldn't be too expensive)
tests look good to me
narwhals/expr.py
Outdated
|
||
Let's define a dataframe-agnostic function: | ||
|
||
>>> def my_library_agnostic_function(df_native: IntoFrameT) -> IntoFrameT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1500 is still in progress, but as we're adding a new function, shall we follow the conventions there?
narwhals/series.py
Outdated
|
||
We define a dataframe-agnostic function: | ||
|
||
>>> def my_library_agnostic_function(s_native: IntoSeriesT) -> IntoSeriesT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
@MarcoGorelli I realized Polars raises an InvalidOperationError for checking >>> import polars as pl
>>> pl.Series(["a", "b"]).is_nan()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/polars/series/utils.py", line 106, in wrapper
return s.to_frame().select_seq(f(*args, **kwargs)).to_series()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py", line 9138, in select_seq
return self.lazy().select_seq(*exprs, **named_exprs).collect(_eager=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 2029, in collect
return wrap_df(ldf.collect(callback))
^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: `is_nan` operation not supported for dtype `str` Should we also raise an exception in these cases? Right now we are returning a Series of False values (though I do need to update this to respect existing Nulls). Some additional testing code. import polars as pl
from datetime import date
data = [[0], [1.1], ["a"], [date.today()], [[0]], [["a"]], [False]]
cant_check = []
for d in data:
s = pl.Series(d)
try:
s.is_nan()
except pl.exceptions.InvalidOperationError:
print(f"\N{cross mark} {s.dtype = }")
else:
print(f"\N{white heavy check mark} {s.dtype = }")
# β
s.dtype = Int64
# β
s.dtype = Float64
# β s.dtype = String
# β s.dtype = Date
# β s.dtype = List(Int64)
# β s.dtype = List(String)
# β s.dtype = Boolean PyArrow exhibits the same behavior >>> import pyarrow as pa
>>> s = pa.array(["1","2"])
>>> pc.is_nan(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pyarrow/compute.py", line 247, in wrapper
return func.call(args, None, memory_pool)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_compute.pyx", line 393, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Function 'is_nan' has no kernel matching input types (string) |
interesting, thanks - yeah that's probably a good idea. then we avoid the risk that people who are too used to pandas accidentally use |
@MarcoGorelli thoughts on the apparent inconsistency in Polars import pandas as pd
import polars as pl
import pyarrow as pa
import narwhals as nw
from narwhals.typing import IntoFrameT
data = {"a": [2, None, 4, 5, 6], "b": [2.0, None, float("nan"), 5.0, 6.0]}
df_pd = pd.DataFrame(data).astype({"a": "Int64"})
df_pl = pl.DataFrame(data)
df_pa = pa.table(data)
def agnostic_is_nan_columns(df_native: IntoFrameT) -> IntoFrameT:
df = nw.from_native(df_native)
return df.with_columns(
a_is_nan=nw.col("a").is_nan(), b_is_nan=nw.col("b").is_nan()
).to_native()
print(agnostic_is_nan_columns(df_pd))
# a b a_is_nan b_is_nan
# 0 2 2.0 False False
# 1 <NA> NaN <NA> True
# 2 4 NaN False True
# 3 5 5.0 False False
# 4 6 6.0 False False
print(agnostic_is_nan_columns(df_pl))
# shape: (5, 4)
# ββββββββ¬βββββββ¬βββββββββββ¬βββββββββββ
# β a β b β a_is_nan β b_is_nan β
# β --- β --- β --- β --- β
# β i64 β f64 β bool β bool β
# ββββββββͺβββββββͺβββββββββββͺβββββββββββ‘
# β 2 β 2.0 β false β false β
# β null β null β false β null β
# β 4 β NaN β false β true β
# β 5 β 5.0 β false β false β
# β 6 β 6.0 β false β false β
# ββββββββ΄βββββββ΄βββββββββββ΄βββββββββββ
print(agnostic_is_nan_columns(df_pa))
# pyarrow.Table
# a: int64
# b: double
# a_is_nan: bool
# b_is_nan: bool
# ----
# a: [[2,null,4,5,6]]
# b: [[2,null,nan,5,6]]
# a_is_nan: [[false,null,false,false,false]]
# b_is_nan: [[false,null,true,false,false]] MRE for the Polars example >>> import polars as pl
>>> pl.__version__
'1.16.0'
>>> pl.DataFrame({"int": [None, 1], "float": [None, 1.0]}).with_columns(pl.all().is_nan().name.suffix("_is_nan"))
shape: (2, 4)
ββββββββ¬ββββββββ¬βββββββββββββ¬βββββββββββββββ
β int β float β int_is_nan β float_is_nan β
β --- β --- β --- β --- β
β i64 β f64 β bool β bool β
ββββββββͺββββββββͺβββββββββββββͺβββββββββββββββ‘
β null β null β false β null β
β 1 β 1.0 β false β false β
ββββββββ΄ββββββββ΄βββββββββββββ΄βββββββββββββββ |
narwhals/_pandas_like/series.py
Outdated
ser = self._native_series | ||
if self.dtype.is_numeric(): | ||
return self._from_native_series(ser != ser) # noqa: PLR0124 | ||
msg = f"`is_nan` is not supported for dtype {self.dtype}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this error message be extended to include a suggestion?
"is_nan
is not supported for dtype {self.dtype}, did you mean to use is_null
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lots of love for kind error messages π
thanks for spotting this! I think this looks like a bug in Polars, shall we report it there? I think either |
Yep- it appears that this bug was noticed in this Polars PR: pola-rs/polars#15889 but not fully addressed until a broader change (apparently all float-specific operations failed to propagate nulls) was made last week for the 1.18 release: pola-rs/polars#20386. I think only allowing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks Cam, awesome stuff! just some minor things
There was a failure for the Modin constructor, which seems to improperly? cast from a pyarrow-nullable backend to a numpy backend. Should narwhals/narwhals/_pandas_like/utils.py Lines 483 to 498 in 7c8c6fa
def get_dtype_backend(dtype: Any, implementation: Implementation) -> str:
if any(implementation is impl for impl in [Implementation.PANDAS, Implementation.MODIN]):
import pandas as pd
if hasattr(pd, "ArrowDtype") and isinstance(dtype, pd.ArrowDtype):
return "pyarrow-nullable"
if implementation is Implementation.PANDAS:
try:
if isinstance(dtype, pd.core.dtypes.dtypes.BaseMaskedDtype):
return "pandas-nullable"
except AttributeError: # pragma: no cover
# defensive check for old pandas versions
pass
return "numpy"
else: # pragma: no cover
return "numpy" Should there be any cases here for CuDF as well? relevant MRE # modin.pandas Series with pyarrow backed dtype
>>> import modin.pandas as mpd
>>> s = mpd.Series([1.0, None]).convert_dtypes(dtype_backend='pyarrow')
>>> s
0 1
1 <NA>
dtype: int64[pyarrow]
>>> import narwhals as nw
>>> ns = nw.from_native(s, series_only=True)
>>> ns.cast(nw.Float64).to_native() # loses pyarrow backend info
0 1.0
1 NaN
dtype: float64
>>> from narwhals._pandas_like.utils import get_dtype_backend
>>> get_dtype_backend(s.dtype, nw.Implementation.MODIN)
'numpy'
>>> get_dtype_backend(s.dtype, nw.Implementation.PANDAS) # correctly returns pyarrow-nullable
'pyarrow-nullable' |
- link out to pandas_like_concepts/null_handling - remove stale raises - fix returns for Series to be more specific
well-spotted, thanks! we should probably do the same for modin then cuDF doesn't seem to have these different dtype backends (it's always nullable arrow dtypes) |
Should the patch be a part of this PR or a new one since it is a semi-related fix? |
either is fine, whichever workflow works best for you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @camriddell !
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below
The
is_nan
implementation for pandas and dask currently return True for the following conditions:This leaves space for false rejections when there is a NaN value inside of a Series with 'object' dtype, though I am uncertain if we care about this edgecase.
Additionally, I was uncertain on how to structure the tests with regards to having NaN values inside of nullable datatypes. I currently dynamically change the expected result but was considering creating separate tests for data that are backed by a nullable datatype vs a non-nullable datatype so some guidance here would be appreciated.