-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC, feat: infer datetime format for pyarrow backend #1195
Conversation
for more information, see https://pre-commit.ci
β¦arwhals-dev/narwhals into feat/pyarrow-to-datetime-infer
ooh, nice! I think this might already be a good start? if we can at least auto-detect iso8601-like formats, that's already an improvement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for doing this!
If I understand correctly, this looks a bit too expensive - we're checking whether each element matches a given format, trying this for multiple formats, and then picking one
This is more expensive that what pandas does, which is just infer the format from the first non-null element, and then use that for all elements
Could we just do that instead?
just tried timing this, and noticed a couple of things:
In [27]: s = pd.Series(pd.date_range('2000', periods=100_000, freq='h')).dt.strftime('%Y-%m-%dT%H:%M:%S')
In [28]: s_pa = pa.chunked_array([s])
In [29]: %time nw.from_native(s_pa, series_only=True).str.to_datetime(format='%Y-%m-%dT%H:%M:%S').to_native()
CPU times: user 5.89 ms, sys: 0 ns, total: 5.89 ms
Wall time: 5.94 ms
Out[29]:
<pyarrow.lib.ChunkedArray object at 0x7fe623eafe80>
[
[
2000-01-01 00:00:00.000000,
2000-01-01 01:00:00.000000,
2000-01-01 02:00:00.000000,
2000-01-01 03:00:00.000000,
2000-01-01 04:00:00.000000,
...
2011-05-29 11:00:00.000000,
2011-05-29 12:00:00.000000,
2011-05-29 13:00:00.000000,
2011-05-29 14:00:00.000000,
2011-05-29 15:00:00.000000
]
]
In [30]: %time nw.from_native(s_pa, series_only=True).str.to_datetime().to_native()
CPU times: user 111 ms, sys: 9.42 ms, total: 121 ms
Wall time: 130 ms
Out[30]:
<pyarrow.lib.ChunkedArray object at 0x7fe62865ca00>
[
[
2000-01-01 00:00:00.000000,
2000-01-01 01:00:00.000000,
2000-01-01 02:00:00.000000,
2000-01-01 03:00:00.000000,
2000-01-01 04:00:00.000000,
...
2011-05-29 11:00:00.000000,
2011-05-29 12:00:00.000000,
2011-05-29 13:00:00.000000,
2011-05-29 14:00:00.000000,
2011-05-29 15:00:00.000000
]
] |
If I take the slice of the first 10 elements, then performances seem to not be impacted - clearly 10 is arbitrary number I tried for a test.
Sure we can increase the supported formats, I just wanted to put the PR out to make sure that the approach is reasonable |
sure, first 10 seems fine, thanks! and we can leave |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something is off with some tests, I will take a look |
@MarcoGorelli should be good to go now :) |
amazing, love this |
What type of PR is this? (check all applicable)
Related issues
to_datetime(format=None)
caseΒ #1151Checklist
If you have comments or can explain your changes, please do so below.
As discussed in the comment of the PR tagged in the issue, the implementation is inspired by and adjusted from the following repository: datetime-format.
I am opening this as draft and looking for help for:
For the two latter questions, sometimes at work we use no separators e.g.
YYYYmmddHHMMSS
PS: CI Failure on doctest is unrelated