-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(decode_bytes
): always backslashreplace when asked to
#742
Conversation
The previous implementation enabled error handling in decoding only for the segment of a bytestring that an exception was raised for. However, it may well be that more decoding errors exist in other parts of the bytestring. I have a complicated real-world case where this happens, i.e. raising `UnicodeDecodeError` again, even though `decode_bytes` was called with `backslash_replace=True`. Unfortunately the data is so large that I did not manage to catch the condition exactly. It seems to be a needless sophistication to decode some part of the bytestring with error handling, but not another. There is a good chance that this patch is badly interacting with the logic to obtain the next chunk before attempting a decoding again.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #742 +/- ##
==========================================
+ Coverage 92.47% 92.51% +0.03%
==========================================
Files 195 195
Lines 14301 14301
Branches 2162 2162
==========================================
+ Hits 13225 13230 +5
+ Misses 812 806 -6
- Partials 264 265 +1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx @mih for catching that. The fix needs a change IMHO.
return ( | ||
position + exc.end, | ||
joined_data[:position + exc.start].decode(encoding) | ||
+ joined_data[position + exc.start:position + exc.end].decode( | ||
encoding, | ||
errors='backslashreplace' | ||
), | ||
joined_data[:position + exc.end].decode( | ||
encoding, errors='backslashreplace') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this does not work. The code duplicates input data, as in the following example:
>>> tuple(decode_bytes([b'08 War \xaf No \xaf More \xaf Trouble.shn.mp3'])
('08 War \\xaf', '08 War \\xaf No \\xaf', '08 War \\xaf No \\xaf More \\xaf', ' Trouble.shn.mp3')
I think the return statement should be (a missing position
index is added):
return (
position + exc.end,
joined_data[position:position + exc.start].decode(encoding)
+ joined_data[position + exc.start:position + exc.end].decode(
encoding,
errors='backslashreplace'
),
)
We should also add a test for proper handling of multiple errors in a single input chunk. For example def test_multiple_errors():
r = ''.join(decode_bytes([b'08 War \xaf No \xaf More \xaf Trouble.shn.mp3']))
assert r == '08 War \\xaf No \\xaf More \\xaf Trouble.shn.mp3' |
Thanks @christian-monch ! Can you make the changes directly, and push into this PR. You have a better understanding of the logic and a fear that more iterations are needed, if I do it. Thanks! |
I just saw that this code has already been migrated to |
An alternative to fixing it here, is to fix it in |
This commit fixes an issue in multiple error handling where parts of the input strings were repeated in the output of `decode_bytes`. It also adds a regreesion test to enure that multiple encoding errors in a single input chunk are handled properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Also merged into datasalad!
The previous implementation enabled error handling in decoding only for the segment of a bytestring that an exception was raised for.
However, it may well be that more decoding errors exist in other parts of the bytestring. I have a complicated real-world case where this happens, i.e. raising
UnicodeDecodeError
again, even thoughdecode_bytes
was called withbackslash_replace=True
. Unfortunately the data is so large that I did not manage to catch the condition exactly.It seems to be a needless sophistication to decode some part of the bytestring with error handling, but not another.
There is a good chance that this patch is badly interacting with the logic to obtain the next chunk before attempting a decoding again.