Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC, feat: infer datetime format for pyarrow backend #1195

Merged
merged 10 commits into from
Oct 29, 2024

Conversation

FBruzzesi
Copy link
Member

@FBruzzesi FBruzzesi commented Oct 16, 2024

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below.

As discussed in the comment of the PR tagged in the issue, the implementation is inspired by and adjusted from the following repository: datetime-format.

I am opening this as draft and looking for help for:

  • Enlarge the test suite with multiple formats, if it would be possible to do this dynamically (maybe with hypothesis it would be great - I never used it though)
  • Add other common formats:
    • allow for AM/PM
    • Are there other date separators?
    • Are there other time separators?
    • other?

For the two latter questions, sometimes at work we use no separators e.g. YYYYmmddHHMMSS

PS: CI Failure on doctest is unrelated

@FBruzzesi FBruzzesi added enhancement New feature or request high priority Your PR will be reviewed very quickly if you address this labels Oct 16, 2024
@MarcoGorelli
Copy link
Member

ooh, nice!

I think this might already be a good start? if we can at least auto-detect iso8601-like formats, that's already an improvement

@FBruzzesi FBruzzesi marked this pull request as ready for review October 29, 2024 09:15
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for doing this!

If I understand correctly, this looks a bit too expensive - we're checking whether each element matches a given format, trying this for multiple formats, and then picking one

This is more expensive that what pandas does, which is just infer the format from the first non-null element, and then use that for all elements

Could we just do that instead?

@MarcoGorelli
Copy link
Member

just tried timing this, and noticed a couple of things:

  • '%Y-%m-%dT%H:%M' doesn't get inferred. I think we should auto-infer this one too
  • the overhead is quite significant:
In [27]: s = pd.Series(pd.date_range('2000', periods=100_000, freq='h')).dt.strftime('%Y-%m-%dT%H:%M:%S')

In [28]: s_pa = pa.chunked_array([s])

In [29]: %time nw.from_native(s_pa, series_only=True).str.to_datetime(format='%Y-%m-%dT%H:%M:%S').to_native()
CPU times: user 5.89 ms, sys: 0 ns, total: 5.89 ms
Wall time: 5.94 ms
Out[29]: 
<pyarrow.lib.ChunkedArray object at 0x7fe623eafe80>
[
  [
    2000-01-01 00:00:00.000000,
    2000-01-01 01:00:00.000000,
    2000-01-01 02:00:00.000000,
    2000-01-01 03:00:00.000000,
    2000-01-01 04:00:00.000000,
    ...
    2011-05-29 11:00:00.000000,
    2011-05-29 12:00:00.000000,
    2011-05-29 13:00:00.000000,
    2011-05-29 14:00:00.000000,
    2011-05-29 15:00:00.000000
  ]
]

In [30]: %time nw.from_native(s_pa, series_only=True).str.to_datetime().to_native()
CPU times: user 111 ms, sys: 9.42 ms, total: 121 ms
Wall time: 130 ms
Out[30]: 
<pyarrow.lib.ChunkedArray object at 0x7fe62865ca00>
[
  [
    2000-01-01 00:00:00.000000,
    2000-01-01 01:00:00.000000,
    2000-01-01 02:00:00.000000,
    2000-01-01 03:00:00.000000,
    2000-01-01 04:00:00.000000,
    ...
    2011-05-29 11:00:00.000000,
    2011-05-29 12:00:00.000000,
    2011-05-29 13:00:00.000000,
    2011-05-29 14:00:00.000000,
    2011-05-29 15:00:00.000000
  ]
]

@FBruzzesi
Copy link
Member Author

If I understand correctly, this looks a bit too expensive - we're checking whether each element matches a given format, trying this for multiple formats, and then picking one

This is more expensive that what pandas does, which is just infer the format from the first non-null element, and then use that for all elements

Could we just do that instead?

If I take the slice of the first 10 elements, then performances seem to not be impacted - clearly 10 is arbitrary number I tried for a test.

'%Y-%m-%dT%H:%M' doesn't get inferred. I think we should auto-infer this one too

Sure we can increase the supported formats, I just wanted to put the PR out to make sure that the approach is reasonable

@MarcoGorelli
Copy link
Member

sure, first 10 seems fine, thanks!

and we can leave ''%Y-%m-%dT%H:%M'' for later

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonderful, thanks @FBruzzesi ! looks good on green

let's make a release and update the plotly PR?

@FBruzzesi
Copy link
Member Author

Something is off with some tests, I will take a look

@FBruzzesi
Copy link
Member Author

@MarcoGorelli should be good to go now :)

@MarcoGorelli
Copy link
Member

amazing, love this

@MarcoGorelli MarcoGorelli merged commit f349cb2 into main Oct 29, 2024
25 checks passed
@FBruzzesi FBruzzesi deleted the feat/pyarrow-to-datetime-infer branch October 29, 2024 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high priority Your PR will be reviewed very quickly if you address this
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enh]: Infer datetime format for pyarrow backend in to_datetime(format=None) case
3 participants