Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] using python -m cudf.pandas and calling hasattr converts NA to NaN #17666

Closed
MarcoGorelli opened this issue Jan 1, 2025 · 6 comments · Fixed by #17677
Closed

[BUG] using python -m cudf.pandas and calling hasattr converts NA to NaN #17666

MarcoGorelli opened this issue Jan 1, 2025 · 6 comments · Fixed by #17677
Assignees
Labels
bug Something isn't working cudf.pandas Issues specific to cudf.pandas

Comments

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Jan 1, 2025

Describe the bug

Here's a complete reproduction: https://colab.research.google.com/drive/1E2bWuCZhuMK_t_aevsWQhbUysSF8hsHt?usp=sharing

src = """
import pandas as pd
df = pd.DataFrame({
    "a": ["a", "a", "b", "b", "b"],
    "b": [1, 2, None, 5, 3],
    "c": [5, 4, 3, 2, 1],
})
print(df)
print(hasattr(df, 'foobar'))
print(df)
"""
with open('f.py', 'w', encoding='utf-8') as fd:
  fd.write(src)

If I then run

%%bash
python -m cudf.pandas f.py

then I get

   a     b  c
0  a   1.0  5
1  a   2.0  4
2  b  <NA>  3
3  b   5.0  2
4  b   3.0  1
False
   a    b  c
0  a  1.0  5
1  a  2.0  4
2  b  NaN  3
3  b  5.0  2
4  b  3.0  1

Spotted in Narwhals

Expected behavior
using hasattr should not change the contents of the dataframe

@MarcoGorelli MarcoGorelli added the bug Something isn't working label Jan 1, 2025
@MarcoGorelli MarcoGorelli changed the title [BUG] using python -m cudf.pandas and using hasattr converts NA to NaN [BUG] using python -m cudf.pandas and calling hasattr converts NA to NaN Jan 1, 2025
@galipremsagar galipremsagar self-assigned this Jan 2, 2025
@mroeschke
Copy link
Contributor

Thanks for the report.

Possibly more simply, once the repr of the pandas DataFrame is accessed it "overrides" the repr of the cudf DataFrame object

In [1]: %load_ext cudf.pandas

In [2]: import pandas as pd
   ...: df = pd.DataFrame({
   ...:     "a": ["a", "a", "b", "b", "b"],
   ...:     "b": [1, 2, None, 5, 3],
   ...:     "c": [5, 4, 3, 2, 1],
   ...: })

In [3]: df
Out[3]: 
   a     b  c
0  a   1.0  5
1  a   2.0  4
2  b  <NA>  3
3  b   5.0  2
4  b   3.0  1

In [4]: df._fsproxy_slow
Out[4]: 
   a    b  c
0  a  1.0  5
1  a  2.0  4
2  b  NaN  3
3  b  5.0  2
4  b  3.0  1

In [5]: df
Out[5]: 
   a    b  c
0  a  1.0  5
1  a  2.0  4
2  b  NaN  3
3  b  5.0  2
4  b  3.0  1

@MarcoGorelli
Copy link
Contributor Author

I'll check tomorrow, but I think it was actually affecting results (e.g. df['b'].cumsum())

@MarcoGorelli
Copy link
Contributor Author

Yup, here's a repro which better demonstrates the issue:

src = """
import pandas as pd
df = pd.DataFrame({
    "a": ["a", "a", "b", "b", "b"],
    "b": [1, 2, None, 5, 3],
    "c": [5, 4, 3, 2, 1],
})
print(df)
print(df.groupby('a')['b'].cumsum())
print(hasattr(df, 'foobar'))
print(df)
print(df.groupby('a')['b'].cumsum())
"""
with open('f.py', 'w', encoding='utf-8') as fd:
  fd.write(src)

The output is

   a     b  c
0  a   1.0  5
1  a   2.0  4
2  b  <NA>  3
3  b   5.0  2
4  b   3.0  1
0     1.0
1     3.0
2    <NA>
3     5.0
4     8.0
Name: b, dtype: float64
False
   a    b  c
0  a  1.0  5
1  a  2.0  4
2  b  NaN  3
3  b  5.0  2
4  b  3.0  1
0    1.0
1    3.0
2    NaN
3    NaN
4    NaN
Name: b, dtype: float64

So, we go from

0     1.0
1     3.0
2    <NA>
3     5.0
4     8.0
Name: b, dtype: float64

to

0    1.0
1    3.0
2    NaN
3    NaN
4    NaN
Name: b, dtype: float64

@mroeschke
Copy link
Contributor

Ah OK thanks for the additional repo.

I think when repr-ing with print, the cudf.pandas df is undergoing a cudf to pandas to cudf roundtrip, and for column B, we're experiencing a dtype roundtrip mismatch e.g.

In [2]: import cudf

In [3]: cudf.DataFrame([1, None]).dtypes
Out[3]: 
0    int64
dtype: object

In [4]: cudf.DataFrame.from_pandas(cudf.DataFrame([1, None]).to_pandas()).dtypes
Out[4]: 
0    float64
dtype: object

@mroeschke mroeschke added the cudf.pandas Issues specific to cudf.pandas label Jan 2, 2025
@galipremsagar
Copy link
Contributor

This is because of the nan_as_null parameter, that is present during the round-trip. I'm working on a fix.

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Jan 3, 2025

sure, thanks

No objections fixing it like this, but I think falling back to pandas after a simple hasattr check is going to disappoint your users, hasattr checks are extremely common in all kinds of libraries (pymc, scikit-learn, ...)

Falling back to pandas just for the sake of raising an error message (which gets discarded by hasattr anyway) seems worse than raising an AttributeError with a marginally different message to pandas'

EDIT: i've made a separate issue about this: #17678

rapids-bot bot pushed a commit that referenced this issue Jan 14, 2025
Fixes: #17666 

This PR ensures we convert all nulls to nan's in float columns only in pandas compatibility mode.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Matthew Murray (https://github.com/Matt711)

URL: #17677
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF Python Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf.pandas Issues specific to cudf.pandas
Projects
Status: Done
3 participants