RFC, feat: infer datetime format for pyarrow backend #1195

FBruzzesi · 2024-10-16T20:30:34Z

What type of PR is this? (check all applicable)

Related issues

Closes [Enh]: Infer datetime format for pyarrow backend in to_datetime(format=None) case #1151

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below.

As discussed in the comment of the PR tagged in the issue, the implementation is inspired by and adjusted from the following repository: datetime-format.

I am opening this as draft and looking for help for:

Enlarge the test suite with multiple formats, if it would be possible to do this dynamically (maybe with hypothesis it would be great - I never used it though)
Add other common formats:
- allow for AM/PM
- Are there other date separators?
- Are there other time separators?
- other?

For the two latter questions, sometimes at work we use no separators e.g. YYYYmmddHHMMSS

PS: CI Failure on doctest is unrelated

for more information, see https://pre-commit.ci

…arwhals-dev/narwhals into feat/pyarrow-to-datetime-infer

MarcoGorelli · 2024-10-17T17:07:57Z

ooh, nice!

I think this might already be a good start? if we can at least auto-detect iso8601-like formats, that's already an improvement

narwhals/_arrow/utils.py

MarcoGorelli

thanks for doing this!

If I understand correctly, this looks a bit too expensive - we're checking whether each element matches a given format, trying this for multiple formats, and then picking one

This is more expensive that what pandas does, which is just infer the format from the first non-null element, and then use that for all elements

Could we just do that instead?

MarcoGorelli · 2024-10-29T10:11:54Z

just tried timing this, and noticed a couple of things:

'%Y-%m-%dT%H:%M' doesn't get inferred. I think we should auto-infer this one too
the overhead is quite significant:

In [27]: s = pd.Series(pd.date_range('2000', periods=100_000, freq='h')).dt.strftime('%Y-%m-%dT%H:%M:%S')

In [28]: s_pa = pa.chunked_array([s])

In [29]: %time nw.from_native(s_pa, series_only=True).str.to_datetime(format='%Y-%m-%dT%H:%M:%S').to_native()
CPU times: user 5.89 ms, sys: 0 ns, total: 5.89 ms
Wall time: 5.94 ms
Out[29]: 
<pyarrow.lib.ChunkedArray object at 0x7fe623eafe80>
[
  [
    2000-01-01 00:00:00.000000,
    2000-01-01 01:00:00.000000,
    2000-01-01 02:00:00.000000,
    2000-01-01 03:00:00.000000,
    2000-01-01 04:00:00.000000,
    ...
    2011-05-29 11:00:00.000000,
    2011-05-29 12:00:00.000000,
    2011-05-29 13:00:00.000000,
    2011-05-29 14:00:00.000000,
    2011-05-29 15:00:00.000000
  ]
]

In [30]: %time nw.from_native(s_pa, series_only=True).str.to_datetime().to_native()
CPU times: user 111 ms, sys: 9.42 ms, total: 121 ms
Wall time: 130 ms
Out[30]: 
<pyarrow.lib.ChunkedArray object at 0x7fe62865ca00>
[
  [
    2000-01-01 00:00:00.000000,
    2000-01-01 01:00:00.000000,
    2000-01-01 02:00:00.000000,
    2000-01-01 03:00:00.000000,
    2000-01-01 04:00:00.000000,
    ...
    2011-05-29 11:00:00.000000,
    2011-05-29 12:00:00.000000,
    2011-05-29 13:00:00.000000,
    2011-05-29 14:00:00.000000,
    2011-05-29 15:00:00.000000
  ]
]

FBruzzesi · 2024-10-29T10:40:03Z

If I understand correctly, this looks a bit too expensive - we're checking whether each element matches a given format, trying this for multiple formats, and then picking one

This is more expensive that what pandas does, which is just infer the format from the first non-null element, and then use that for all elements

Could we just do that instead?

If I take the slice of the first 10 elements, then performances seem to not be impacted - clearly 10 is arbitrary number I tried for a test.

'%Y-%m-%dT%H:%M' doesn't get inferred. I think we should auto-infer this one too

Sure we can increase the supported formats, I just wanted to put the PR out to make sure that the approach is reasonable

MarcoGorelli · 2024-10-29T10:42:02Z

sure, first 10 seems fine, thanks!

and we can leave ''%Y-%m-%dT%H:%M'' for later

MarcoGorelli

wonderful, thanks @FBruzzesi ! looks good on green

let's make a release and update the plotly PR?

FBruzzesi · 2024-10-29T10:54:32Z

Something is off with some tests, I will take a look

FBruzzesi · 2024-10-29T11:04:20Z

@MarcoGorelli should be good to go now :)

MarcoGorelli · 2024-10-29T11:49:37Z

amazing, love this

feat: infer datetime format for pyarrow

3a26b96

FBruzzesi added enhancement New feature or request high priority Your PR will be reviewed very quickly if you address this labels Oct 16, 2024

FBruzzesi and others added 4 commits October 17, 2024 12:48

merge main

d7d4712

Merge branch 'main' into feat/pyarrow-to-datetime-infer

bc5f854

[pre-commit.ci] auto fixes from pre-commit.com hooks

a473699

for more information, see https://pre-commit.ci

Merge branch 'feat/pyarrow-to-datetime-infer' of https://github.com/n…

5e40493

…arwhals-dev/narwhals into feat/pyarrow-to-datetime-infer

Merge branch 'main' into feat/pyarrow-to-datetime-infer

87449b5

LiamConnors reviewed Oct 25, 2024

View reviewed changes

narwhals/_arrow/utils.py Show resolved Hide resolved

FBruzzesi added 2 commits October 29, 2024 09:19

Merge branch 'main' into feat/pyarrow-to-datetime-infer

7245e0f

fix for date format

260ea69

FBruzzesi marked this pull request as ready for review October 29, 2024 09:15

MarcoGorelli reviewed Oct 29, 2024

View reviewed changes

use first 10 non null values only to infer format

7d73981

MarcoGorelli approved these changes Oct 29, 2024

View reviewed changes

test with null

4fc9c94

MarcoGorelli merged commit f349cb2 into main Oct 29, 2024
25 checks passed

FBruzzesi deleted the feat/pyarrow-to-datetime-infer branch October 29, 2024 12:45

FBruzzesi mentioned this pull request Oct 29, 2024

[Enh]: Increase support for datetime format in pyarrow automatic inference #1282

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC, feat: infer datetime format for pyarrow backend #1195

RFC, feat: infer datetime format for pyarrow backend #1195

FBruzzesi commented Oct 16, 2024 •

edited

Loading

MarcoGorelli commented Oct 17, 2024

MarcoGorelli left a comment

MarcoGorelli commented Oct 29, 2024

FBruzzesi commented Oct 29, 2024

MarcoGorelli commented Oct 29, 2024

MarcoGorelli left a comment •

edited

Loading

FBruzzesi commented Oct 29, 2024

FBruzzesi commented Oct 29, 2024

MarcoGorelli commented Oct 29, 2024

RFC, feat: infer datetime format for pyarrow backend #1195

RFC, feat: infer datetime format for pyarrow backend #1195

Conversation

FBruzzesi commented Oct 16, 2024 • edited Loading

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below.

MarcoGorelli commented Oct 17, 2024

MarcoGorelli left a comment

Choose a reason for hiding this comment

MarcoGorelli commented Oct 29, 2024

FBruzzesi commented Oct 29, 2024

MarcoGorelli commented Oct 29, 2024

MarcoGorelli left a comment • edited Loading

Choose a reason for hiding this comment

FBruzzesi commented Oct 29, 2024

FBruzzesi commented Oct 29, 2024

MarcoGorelli commented Oct 29, 2024

FBruzzesi commented Oct 16, 2024 •

edited

Loading

MarcoGorelli left a comment •

edited

Loading