Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC (string dtype): updated 'Working with text data' for str dtype in pandas 3.0 #60535

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Uvi-12
Copy link
Contributor

@Uvi-12 Uvi-12 commented Dec 10, 2024

Closes #60348

This PR updates the "Working with Text Data" page in the pandas documentation to reflect the change in pandas 3.0 where "str" dtype is now the default.

@Uvi-12
Copy link
Contributor Author

Uvi-12 commented Dec 10, 2024

pre-commit.ci autofix

@Uvi-12
Copy link
Contributor Author

Uvi-12 commented Dec 17, 2024

Hi @mroeschke, I've made some updates to the 'text.rst' file which updated 'Working with text data' for str dtype in pandas 3.0

I noticed that after updating the branch, a commit for expressions.py was added, although I have made no commits in the file. Could you please help me understand the changes and guide me through?

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I think it'd be good to also link to the PDEP for history - perhaps at the top?

https://pandas.pydata.org/pdeps/0014-string-dtype.html

@@ -15,8 +15,9 @@ There are two ways to store text data in pandas:

1. ``object`` -dtype NumPy array.
2. :class:`StringDtype` extension type.
3. ``str`` -dtype (default from pandas 3.0).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L14 needs to be updated. Also, should mention string here too I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this is written, it sounds like str is entirely separate from StringDtype, but that isn't the case.

pd.set_option("future.infer_string", True)
ser1 = pd.Series(list("xyz"), dtype="str")
ser2 = pd.Series(list("xyz"), dtype=pd.StringDtype("pyarrow", np.nan))
print(ser1.dtype == ser2.dtype)
# True

Maybe just mention that "str", when future.infer_string is set to True, is an alias for pd.StringDtype("pyarrow", np.nan)) or pd.StringDtype("python", np.nan)) depending on whether pyarrow is installed.


We recommend using :class:`StringDtype` to store text data.
We recommend using the ``str`` dtype or :class:`StringDtype` to store text data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or string. Maybe reword to something like

We recommend __not__ using ``object`` dtype to store text data.

Comment on lines 38 to 39
Use the nullable :class:`StringDtype` (``"string"``) when handling NA values in your string data. It offers
additional flexibility for missing values while maintaining compatibility with pandas' nullable types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str also handles NA values, the difference between np.NaN vs pd.NA. I think it'd be good to clarify that here.


.. _text.differences:

Behavior differences
^^^^^^^^^^^^^^^^^^^^

These are places where the behavior of ``StringDtype`` objects differ from
``object`` dtype
These are places where the behavior of ``StringDtype`` or ``str`` objects differ from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a mention of string too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually - can just leave this as StringDtype - no need to mention the aliases.

Comment on lines 160 to 163
3. In comparison operations, :class:`arrays.StringArray`, ``Series`` backed
by a ``StringArray``, and ``str`` dtype will return an object with :class:`BooleanDtype`,
rather than a ``bool`` dtype object. Missing values in these types will propagate
in comparison operations, rather than always comparing unequal like :attr:`numpy.nan`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true for str.

``Series``.

.. _text.warn_types:

.. warning::

The type of the Series is inferred and is one among the allowed types (i.e. strings).
The type of the Series is inferred as ``str`` or ``string`` depending on the context.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what context do we infer string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string dtype is inferred when the data includes pd.NA or other nullable types, ensuring compatibility with pandas' nullable ecosystem. It is also inferred when explicitly specified by the user with dtype="string". Otherwise, str is typically inferred for text data. Please correct me if I am wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string dtype is inferred when the data includes pd.NA or other nullable types, ensuring compatibility with pandas' nullable ecosystem.

pandas will not infer string here. This is consistent with other nullable dtypes.

pd.set_option("future.infer_string", True)

ser = pd.Series(["a", pd.NA, "c"])
print(ser.dtype)
# str

ser = pd.Series([1, pd.NA, 3])
print(ser.dtype)
# object

It is also inferred when explicitly specified by the user with dtype="string".

This is not inference - inference, by definition, is the behavior when it is not explicitly specified.

Copy link
Contributor Author

@Uvi-12 Uvi-12 Jan 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification, Shall I change it back to the original or update it with an explanation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think:

When the option future.infer_string is set to True, the type of the Series is inferred as str.

@@ -396,7 +431,7 @@ Missing values on either side will result in missing values in the result as wel
Concatenating a Series and something array-like into a Series
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The parameter ``others`` can also be two-dimensional. In this case, the number or rows must match the lengths of the calling ``Series`` (or ``Index``).
The parameter ``others`` can also be two-dimensional. In this case, the number or rows must match the length of the calling ``Series`` (or ``Index``).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the number or rows should be of instead

Comment on lines 260 to 265

return (
_where(cond, left_op, right_op)
if use_numexpr
else _where_standard(cond, left_op, right_op)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unintentional change was due to merging main, can you revert.

@rhshadrach rhshadrach added Docs Strings String extension data type and string data labels Dec 30, 2024
@Uvi-12 Uvi-12 requested a review from rhshadrach January 1, 2025 08:09
@Uvi-12
Copy link
Contributor Author

Uvi-12 commented Jan 1, 2025

I have made changes according to the review, but somehow I can't revert the changes in expressions.py. Can you please help me with it?

@rhshadrach
Copy link
Member

rhshadrach commented Jan 1, 2025

I have made changes according to the review, but somehow I can't revert the changes in expressions.py. Can you please help me with it?

You can open the file with an editor and manually modify the file undoing the changes. Locally, you can run

git diff upstream/main pandas/core/computation/expressions.py

and keep making modifications until that reports no differences.

If you'd like, I can push a commit undoing the changes there.

@Uvi-12
Copy link
Contributor Author

Uvi-12 commented Jan 1, 2025

If you'd like, I can push a commit undoing the changes there.

It would be very helpful, Thank you.

I have made the changes as per the review, please let me know if there are any other modifications required in the documentation.

@Uvi-12
Copy link
Contributor Author

Uvi-12 commented Jan 9, 2025

@rhshadrach, is everything alright, or are modifications needed?

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several mentions of `pd.StringDtype` or `str` as if these are distinct entities. As mentioned below, they are not. I think we should make it clear that str is an alias, and just mention pd.StringDtype from there on. But open to other approaches too.

@@ -15,8 +15,9 @@ There are two ways to store text data in pandas:

1. ``object`` -dtype NumPy array.
2. :class:`StringDtype` extension type.
3. ``str`` -dtype (default from pandas 3.0).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this is written, it sounds like str is entirely separate from StringDtype, but that isn't the case.

pd.set_option("future.infer_string", True)
ser1 = pd.Series(list("xyz"), dtype="str")
ser2 = pd.Series(list("xyz"), dtype=pd.StringDtype("pyarrow", np.nan))
print(ser1.dtype == ser2.dtype)
# True

Maybe just mention that "str", when future.infer_string is set to True, is an alias for pd.StringDtype("pyarrow", np.nan)) or pd.StringDtype("python", np.nan)) depending on whether pyarrow is installed.

when creating new data structures.

Use the nullable :class:`StringDtype` (``"string"``) or ``str`` dtype when handling NA values in your string data.
Note that ``StringDtype`` uses ``pd.NA`` for missing values, whereas ``str`` dtype uses ``np.NaN``. ``StringDtype``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect, since StringDtype accepts an na_value argument to choose between pd.NA and np.nan. np.NaN has been removed as of NumPy 2.0, so shouldn't be used.

Comment on lines +67 to 69
pd.Series(["a", "b", "c"], dtype="str")
pd.Series(["a", "b", "c"], dtype="string")
pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also show passing arguments to StringDtype.


.. _text.differences:

Behavior differences
^^^^^^^^^^^^^^^^^^^^

These are places where the behavior of ``StringDtype`` objects differ from
``object`` dtype
These are places where the behavior of ``StringDtype`` or ``str`` objects differ from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually - can just leave this as StringDtype - no need to mention the aliases.

``Series``.

.. _text.warn_types:

.. warning::

The type of the Series is inferred and is one among the allowed types (i.e. strings).
The type of the Series is inferred as ``str`` or ``string`` depending on the context.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think:

When the option future.infer_string is set to True, the type of the Series is inferred as str.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC (string dtype): update user guide page "Working with text data"
2 participants