-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(KeyValuePairLogEvent): Extend to_dict()
to support custom encoding and error handling for UnicodeDecodeError
(resolves #129).
#131
Conversation
WalkthroughThis pull request introduces enhancements to the Changes
Possibly related PRs
Suggested reviewers
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/test_ir/test_key_value_pair_log_event.py (1)
69-71
: UseassertIsNotNone
for clarity in assertionsConsider using
self.assertIsNotNone(log_event)
instead ofself.assertNotEqual(log_event, None)
andassert log_event is not None
for consistency and clarity within theunittest
framework.Apply this change:
- self.assertNotEqual(log_event, None) - assert log_event is not None + self.assertIsNotNone(log_event)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
clp_ffi_py/ir/native.pyi
(2 hunks)src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp
(5 hunks)src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.hpp
(3 hunks)tests/test_ir/test_key_value_pair_log_event.py
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
📓 Learnings (2)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (1)
Learnt from: LinZhihao-723
PR: y-scope/clp-ffi-py#81
File: src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.hpp:0-0
Timestamp: 2024-11-14T06:30:18.577Z
Learning: In `PyKeyValuePairLogEvent`, we assume invalid objects will never be created, so null pointer checks in methods like `get_kv_pair_log_event()` are unnecessary, assuming CPython system runs correctly.
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.hpp (1)
Learnt from: LinZhihao-723
PR: y-scope/clp-ffi-py#81
File: src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.hpp:42-44
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In the C++ codebase, for consistency with other classes, methods like `get_kv_pair_log_event()` in `src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.hpp` should return pointers instead of references.
⏰ Context from checks skipped due to timeout of 90000ms (20)
- GitHub Check: Build cp310-win_amd64
- GitHub Check: Build cp312-manylinux_aarch64
- GitHub Check: Build cp311-musllinux_x86_64
- GitHub Check: Build cp311-musllinux_i686
- GitHub Check: Build cp311-musllinux_aarch64
- GitHub Check: Build cp311-manylinux_x86_64
- GitHub Check: Build cp311-manylinux_i686
- GitHub Check: Build cp311-manylinux_aarch64
- GitHub Check: Build cp310-musllinux_x86_64
- GitHub Check: Build cp310-musllinux_i686
- GitHub Check: Build cp310-musllinux_aarch64
- GitHub Check: Build cp310-manylinux_x86_64
- GitHub Check: Build cp310-manylinux_i686
- GitHub Check: Build cp310-manylinux_aarch64
- GitHub Check: Build cp311-macosx_arm64
- GitHub Check: Build cp310-macosx_x86_64
- GitHub Check: Build cp310-macosx_universal2
- GitHub Check: Build cp310-macosx_arm64
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (11)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (3)
112-126
: Documentation is clear and comprehensiveThe updated docstring for the
to_dict
method accurately describes the purpose, parameters, and return value, providing helpful references to Python's standard encodings and error handlers. The use of Canadian English spelling in words like "behaviours" maintains consistency.
244-302
: Optimized handling of encoding parametersThe implementation of
PyKeyValuePairLogEvent_to_dict
efficiently checks whether the default encoding and error parameters are used, selecting the appropriate Unicode decoding function accordingly. This approach minimizes performance overhead when default parameters are provided, aligning with the PR objective of maintaining optimal runtime performance.
Line range hint
422-434
: Proper encapsulation within internal namespaceMoving the
decode_as_encoded_text_ast
function into thePyKeyValuePairLogEvent_internal
namespace enhances code organization and encapsulation, clearly indicating its intended internal use within the module.src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.hpp (4)
6-14
: Necessary header inclusions for template supportThe addition of headers such as
<concepts>
,<optional>
, and<string_view>
is appropriate to support the new templated methods and concepts introduced in the code.
27-43
: Well-defined concept for string view to Unicode conversionThe
StringViewToPyUnicodeMethodReq
concept is correctly defined to enforce that the provided method converts astd::string_view
to aPyObject*
. This enhances type safety and ensures that theto_dict
method operates with the expected interface.
121-131
: Template-basedto_dict
method improves flexibilityUpdating the
to_dict
method to be templated allows for customizable string decoding methods, providing greater flexibility while maintaining performance. This aligns with the PR's goal of supporting different encodings and error handling strategies.
129-131
: Methodget_kv_pair_log_event
returns pointer as per codebase conventionThe
get_kv_pair_log_event
method now returns a pointer instead of a reference, aligning with the project's coding conventions and the learnings from previous work.tests/test_ir/test_key_value_pair_log_event.py (2)
1-3
: Appropriate import additions for testingThe imports of
BytesIO
,Deserializer
, andSerializer
are necessary to support the new test cases added, ensuring that the test environment is properly set up.
46-86
: Comprehensive test for invalid UTF-8 encoding handlingThe
test_invalid_utf8_encoding
method effectively verifies the handling of invalid UTF-8 encoded strings with various encodings and error handling strategies. It ensures that exceptions are raised appropriately and that theto_dict
method can successfully decode strings when correct parameters are provided.clp_ffi_py/ir/native.pyi (2)
5-5
: LGTM! Import statement is properly organized.The addition of
Tuple
to the typing imports is appropriate and necessary for the return type annotation.
90-92
: Verify the encoding parameter values match Python's supported encodings.The method signature looks good and follows Python's string encoding conventions. Let's verify that the default values are supported.
✅ Verification successful
The encoding parameters are valid and well-supported
The default values
encoding="utf-8"
anderrors="strict"
are confirmed to be supported by Python's standard library and follow best practices for string encoding.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that the default encoding and error handler values are supported by Python python3 -c ' import codecs # Check if utf-8 encoding is supported try: codecs.lookup("utf-8") print("✓ utf-8 encoding is supported") except LookupError: print("✗ utf-8 encoding is not supported") # Check if strict error handler is supported try: "test".encode("utf-8", errors="strict") print("✓ strict error handler is supported") except LookupError: print("✗ strict error handler is not supported") # List all supported error handlers for reference print("\nSupported error handlers:") print(codecs.lookup_error("strict").__class__.__name__) for name in ["strict", "ignore", "replace", "xmlcharrefreplace", "backslashreplace", "namereplace"]: try: handler = codecs.lookup_error(name) print(f"- {name}") except LookupError: continue 'Length of output: 996
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the logic and Python interfaces lgtm. let's fix some nits in the docs
" to those in Python's `str` objects, for converting C strings into Python Unicode" | ||
" objects.\n\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
" to those in Python's `str` objects, for converting C strings into Python Unicode" | |
" objects.\n\n" | |
" to those in Python's `bytes.decode()` method, for converting C strings into Python" | |
" Unicode objects.\n\n" |
or
" to those in Python's `str` objects, for converting C strings into Python Unicode" | |
" objects.\n\n" | |
" to those in Python's `codecs.decode()` method, for converting C strings into Python" | |
" Unicode objects.\n\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I referred to str
since PyUnicode_Decode
does so: https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_Decode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. For better clarity, can we say "Python's str()
built-in function" instead of "Python's str
objects"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk make sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied some rewording to make it as close to PyUnicode_Decode
's documentation as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the PR title, how about
feat(KeyValuePairLogEvent): Extend `to_dict()` to support custom encoding and error handling (resolves #129).
Co-authored-by: Junhao Liao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
tests/test_ir/test_key_value_pair_log_event.py (3)
52-57
: Consider adding more descriptive test data documentation.The comment about msgpack map could be more detailed to explain why these specific byte values were chosen and what they represent in cp932 encoding.
- # msgpack map: {"key": 0x970x5c}, where "0x970x5c" is encoded using "cp932" + # msgpack map: {"key": 0x970x5c} + # Note: The byte sequence 0x970x5c represents the character '\' (fullwidth reverse solidus) + # when decoded using cp932 encoding, but is invalid in UTF-8
69-70
: Consider using more idiomatic assertions.The combination of
assertNotEqual
andassert
could be replaced with a single assertion.- self.assertNotEqual(log_event, None) - assert log_event is not None + self.assertIsNotNone(log_event, "Failed to deserialize log event")
46-86
: Consider adding more test cases for comprehensive coverage.The test covers basic scenarios but could be enhanced with:
- Different error handlers (e.g., "replace", "backslashreplace")
- Mixed valid/invalid UTF-8 sequences
- Edge cases like empty strings or very long strings
Would you like me to help generate additional test cases?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/test_ir/test_key_value_pair_log_event.py
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (20)
- GitHub Check: Build cp311-win_amd64
- GitHub Check: Build cp310-win_amd64
- GitHub Check: Build cp311-musllinux_x86_64
- GitHub Check: Build cp311-musllinux_i686
- GitHub Check: Build cp311-musllinux_aarch64
- GitHub Check: Build cp311-manylinux_x86_64
- GitHub Check: Build cp311-manylinux_i686
- GitHub Check: Build cp311-manylinux_aarch64
- GitHub Check: Build cp310-musllinux_x86_64
- GitHub Check: Build cp310-musllinux_i686
- GitHub Check: Build cp310-musllinux_aarch64
- GitHub Check: Build cp310-manylinux_x86_64
- GitHub Check: Build cp310-manylinux_i686
- GitHub Check: Build cp310-manylinux_aarch64
- GitHub Check: Build cp311-macosx_arm64
- GitHub Check: Build cp310-macosx_x86_64
- GitHub Check: Build cp310-macosx_universal2
- GitHub Check: Build cp310-macosx_arm64
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (1)
tests/test_ir/test_key_value_pair_log_event.py (1)
1-8
: LGTM! Import changes are well-organized.The new imports are properly categorized and all are utilized in the new test implementation.
to_dict
(resolves #129).to_dict()
to support custom Unicode encoding and error handling (resolves #129).
to_dict()
to support custom Unicode encoding and error handling (resolves #129).to_dict()
to support custom encoding and error handling for Unicode (resolves #129).
Add "for Unicode" at the end to clarify what is the encoding target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (2)
112-127
: Refine docstring for clarity and adopt Canadian spelling.These lines contain minor grammatical issues and use "behaviors" instead of "behaviours." Also, "build-in function" should be "built-in function," and consider adding "to" before "similar." For example:
- "The `encoding` and `errors` parameters allow customization of Unicode behaviors, similar" - " Python's `str` build-in function, for converting C strings into Python Unicode" + "The `encoding` and `errors` parameters allow customisation of Unicode behaviours, similar to" + " Python's built-in `str` function, for converting C strings into Python Unicode"
244-302
: Consider normalising encoding checks to handle case insensitivity.Right now, the code treats “utf-8” differently from “UTF-8” or “utf8”. If a user passes a variation like “UTF8”, the code path will differ from the intended default path and potentially suffer a performance penalty. Also confirm if you want to accept synonyms for “utf-8” by normalising or ignoring case, if that is consistent with the rest of your library’s design.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp
(5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
📓 Learnings (1)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (1)
Learnt from: LinZhihao-723
PR: y-scope/clp-ffi-py#81
File: src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.hpp:0-0
Timestamp: 2024-11-14T06:30:18.577Z
Learning: In `PyKeyValuePairLogEvent`, we assume invalid objects will never be created, so null pointer checks in methods like `get_kv_pair_log_event()` are unnecessary, assuming CPython system runs correctly.
⏰ Context from checks skipped due to timeout of 90000ms (20)
- GitHub Check: Build cp311-win_amd64
- GitHub Check: Build cp310-win_amd64
- GitHub Check: Build cp311-musllinux_x86_64
- GitHub Check: Build cp311-musllinux_i686
- GitHub Check: Build cp311-musllinux_aarch64
- GitHub Check: Build cp311-manylinux_x86_64
- GitHub Check: Build cp311-manylinux_i686
- GitHub Check: Build cp311-manylinux_aarch64
- GitHub Check: Build cp310-musllinux_x86_64
- GitHub Check: Build cp310-musllinux_i686
- GitHub Check: Build cp310-musllinux_aarch64
- GitHub Check: Build cp310-manylinux_x86_64
- GitHub Check: Build cp310-manylinux_i686
- GitHub Check: Build cp310-manylinux_aarch64
- GitHub Check: Build cp311-macosx_arm64
- GitHub Check: Build cp310-macosx_x86_64
- GitHub Check: Build cp310-macosx_universal2
- GitHub Check: Build cp310-macosx_arm64
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (5)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (5)
132-134
: Forward declaration aligns with docstring and usage.The method signature matches the documented parameters and return type. No issues to address here.
145-145
: Method table registration looks correct.Using
METH_VARARGS | METH_KEYWORDS
is appropriate for a function with optional parameters.
420-420
: Namespace closing brace verified.No concerns here.
422-422
: Namespace introduced for internal helper functions.This nicely isolates internal logic while keeping the main namespace clean.
434-434
: Closing brace confirmed.No issues found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (1)
115-126
: Grammar refinement
Please replace “allowing users to custom the behaviour” with “allowing users to customize the behaviour.”- name in Python's `str()` built-in function, allowing users to custom the behaviour of + name in Python's `str()` built-in function, allowing users to customize the behaviour of
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp
(5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
📓 Learnings (1)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (1)
Learnt from: LinZhihao-723
PR: y-scope/clp-ffi-py#81
File: src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.hpp:0-0
Timestamp: 2024-11-14T06:30:18.577Z
Learning: In `PyKeyValuePairLogEvent`, we assume invalid objects will never be created, so null pointer checks in methods like `get_kv_pair_log_event()` are unnecessary, assuming CPython system runs correctly.
⏰ Context from checks skipped due to timeout of 90000ms (20)
- GitHub Check: Build cp311-win_amd64
- GitHub Check: Build cp310-win_amd64
- GitHub Check: Build cp311-musllinux_x86_64
- GitHub Check: Build cp311-musllinux_i686
- GitHub Check: Build cp311-musllinux_aarch64
- GitHub Check: Build cp311-manylinux_x86_64
- GitHub Check: Build cp311-manylinux_i686
- GitHub Check: Build cp311-manylinux_aarch64
- GitHub Check: Build cp310-musllinux_x86_64
- GitHub Check: Build cp310-musllinux_i686
- GitHub Check: Build cp310-musllinux_aarch64
- GitHub Check: Build cp310-manylinux_x86_64
- GitHub Check: Build cp310-manylinux_i686
- GitHub Check: Build cp310-manylinux_aarch64
- GitHub Check: Build cp311-macosx_arm64
- GitHub Check: Build cp310-macosx_x86_64
- GitHub Check: Build cp310-macosx_universal2
- GitHub Check: Build cp310-macosx_arm64
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (15)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (15)
112-112
: Docstring reflects new parameters accurately
This update aligns with the new parameter signature forto_dict
.
145-145
: Method definition flags look good
Switching toMETH_VARARGS | METH_KEYWORDS
is consistent with accepting keyword arguments.
244-245
: Implementation matches the declaration
The method definition aligns with the updated docstring and signature.
246-246
: Return type is clear
The return typePyObject*
is appropriate for CPython API usage.
247-248
: Keyword parameters named well
The variableskeyword_encoding
andkeyword_errors
are clear and consistent with the documentation.
249-253
: Keyword table structure is correct
Defining the table for argument parsing is set up properly.
255-256
: Constants for default encoding and errors
Usingconstexpr std::string_view
improves clarity and avoids magic string literals.
258-262
: Default parameter initialization
Assigning defaults forencoding_c_str
anderrors_c_str
follows good practice.
263-276
: Robust argument parsing
The implementation correctly parses optional parameters for encoding and errors, returning early if parsing fails.
277-286
: Selective handling for non-default encoding
UsingPyUnicode_Decode
with custom encoding and error handling ensures flexibility for client code.
288-298
: Conditional branch for error handling
Distinguishing between custom error handlers with a default encoding is a well-structured approach, relying onPyUnicode_DecodeUTF8
.
299-302
: Efficient fallback for defaults
Falling back toPyUnicode_FromStringAndSize
for the common default scenario avoids extra overhead.
420-420
: No issues
The unnamed namespace closure is consistent with typical C++ usage.
Line range hint
422-434
: Internal namespace usage
Encapsulating helper logic withinPyKeyValuePairLogEvent_internal
keeps the interface clean.
132-133
: Ensure all function calls are updated to pass new arguments
This method signature now expectsargs
andkeywords
. Please confirm that calling sites throughout the codebase are up-to-date.
I had some misconceptions about how Python uses Unicode as the encoding for all
|
Yeah, in Python3 |
to_dict()
to support custom encoding and error handling for Unicode (resolves #129).to_dict()
to support custom encoding and error handling for UnicodeDecodeError
(resolves #129).
Description
This PR implements #129.
In this PR, we extend the current
KeyValueLogEvent.to_dict
to receive standard Python unicode encoding types and error handlers from users, which helps decode strings with invalid UTF-8 sequences. The method mirrors Python's str object with encoding and errors. The underlying CPython implementation will use PyUnicode_decode to apply the given encoding standard and error handler.However, we realize that if we replace the current
PyUnicode_FromStringAndSize
withPyUnicode_Decode
, the runtime of the private micro-benchmark will increase by around 7-8% to deserialize an IR stream. This is becausePyUnicode_Decode
method will involve more underlying string compare operations on the givenerrors
andencoding
to determine which decoding algorithm to use. This check is needed for everyPyUnicode_Decode
call when converting a C++ string to a Python Unicode object.To solve this issue, we templated
to_dict
's CPython implementation to receive different string-to-py-unicode methods. This allows us to checkencoding
anderrors
at the beginning and decide whether we can usePyUnicode_FromStringAndSize
to avoid the overhead ofPyUnicode_Decode
when the giving arguments are the default. This helps to reduce the runtime regression if users don't specify any errors or encoding types different to the default ones.Due to the fact that we need to template
to_dict
's C++ implementation, we move some methods from the source file to the header, which include:PyKeyValuePairLogEvent::to_dict
(template function)serialize_node_id_value_pair_to_py_dict
(template function, used byPyKeyValuePairLogEvent::to_dict
)insert_kv_pair_into_py_dict
(template function, used byserialize_node_id_value_pair_to_py_dict
)PyDictSerializationIterator
(class, used byserialize_node_id_value_pair_to_py_dict
)Notice that
PyKeyValuePairLogEvent_internal
namespace is used for internal helpers, as discussed in our latest coding guideline here.Validation performed
"ignore"
error handlerSummary by CodeRabbit
Release Notes
New Features
to_dict()
method with customizable string encoding options.Improvements
Testing
These updates provide greater control over string encoding when working with log events, allowing users to specify custom encoding and error-handling strategies.