Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(KeyValuePairLogEvent): Extend to_dict() to support custom encoding and error handling for UnicodeDecodeError (resolves #129). #131

Merged
merged 14 commits into from
Jan 24, 2025

Conversation

LinZhihao-723
Copy link
Member

@LinZhihao-723 LinZhihao-723 commented Jan 23, 2025

Description

This PR implements #129.
In this PR, we extend the current KeyValueLogEvent.to_dict to receive standard Python unicode encoding types and error handlers from users, which helps decode strings with invalid UTF-8 sequences. The method mirrors Python's str object with encoding and errors. The underlying CPython implementation will use PyUnicode_decode to apply the given encoding standard and error handler.

However, we realize that if we replace the current PyUnicode_FromStringAndSize with PyUnicode_Decode, the runtime of the private micro-benchmark will increase by around 7-8% to deserialize an IR stream. This is because PyUnicode_Decode method will involve more underlying string compare operations on the given errors and encoding to determine which decoding algorithm to use. This check is needed for every PyUnicode_Decode call when converting a C++ string to a Python Unicode object.

To solve this issue, we templated to_dict's CPython implementation to receive different string-to-py-unicode methods. This allows us to check encoding and errors at the beginning and decide whether we can use PyUnicode_FromStringAndSize to avoid the overhead of PyUnicode_Decode when the giving arguments are the default. This helps to reduce the runtime regression if users don't specify any errors or encoding types different to the default ones.

Due to the fact that we need to template to_dict's C++ implementation, we move some methods from the source file to the header, which include:

  • PyKeyValuePairLogEvent::to_dict (template function)
  • serialize_node_id_value_pair_to_py_dict (template function, used by PyKeyValuePairLogEvent::to_dict)
  • insert_kv_pair_into_py_dict (template function, used by serialize_node_id_value_pair_to_py_dict)
  • PyDictSerializationIterator (class, used by serialize_node_id_value_pair_to_py_dict)

Notice that PyKeyValuePairLogEvent_internal namespace is used for internal helpers, as discussed in our latest coding guideline here.

Validation performed

  • Ensure all workflows passed.
  • Ensure the API doc can be successfully compiled and rendered.
  • Add new unit test case to:
    • Serialize/deserialize IR that contains non-UTF8 strings
    • Ensure a log event with such a string:
      • Can't be converted to Python dictionary with the default encoding and errors setting
      • Can't be converted to Python dictionary with the wrong encoding type
      • Can be successfully converted to Python dictionary with the proper encoding type (cp932)
      • Can be successfully converted to Python dictionary with "ignore" error handler

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced to_dict() method with customizable string encoding options.
    • Added support for specifying encoding and error handling when converting log events to dictionaries.
  • Improvements

    • Improved flexibility in handling Unicode string conversions.
    • More robust error management during dictionary serialization.
  • Testing

    • Added comprehensive test coverage for invalid UTF-8 encoding scenarios.

These updates provide greater control over string encoding when working with log events, allowing users to specify custom encoding and error-handling strategies.

Copy link

coderabbitai bot commented Jan 23, 2025

Walkthrough

This pull request introduces enhancements to the to_dict() method in the KeyValuePairLogEvent class, modifying its signature to include optional encoding and errors parameters for more flexible string decoding during dictionary conversion. The changes span multiple files, including updates to the type stub, C++ implementation, and header files to support customizable Unicode string handling. Additionally, a new test method is added to validate the handling of invalid UTF-8 encoded strings.

Changes

File Change Summary
clp_ffi_py/ir/native.pyi Updated to_dict() method signature to include encoding and errors parameters with default values.
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp Removed PyDictSerializationIterator and updated to_dict() method to handle new encoding parameters. Modified method signature to accept variable arguments.
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.hpp Added new concept StringViewToPyUnicodeMethodReq, modified to_dict() method signature, introduced new serialization classes and methods for key-value pair handling.
tests/test_ir/test_key_value_pair_log_event.py Added new test method test_invalid_utf8_encoding to validate encoding handling and updated import statements.

Possibly related PRs

Suggested reviewers

  • haiqi96
✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/test_ir/test_key_value_pair_log_event.py (1)

69-71: Use assertIsNotNone for clarity in assertions

Consider using self.assertIsNotNone(log_event) instead of self.assertNotEqual(log_event, None) and assert log_event is not None for consistency and clarity within the unittest framework.

Apply this change:

- self.assertNotEqual(log_event, None)
- assert log_event is not None
+ self.assertIsNotNone(log_event)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 034dfa2 and bfcb7e9.

📒 Files selected for processing (4)
  • clp_ffi_py/ir/native.pyi (2 hunks)
  • src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (5 hunks)
  • src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.hpp (3 hunks)
  • tests/test_ir/test_key_value_pair_log_event.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (2)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (1)
Learnt from: LinZhihao-723
PR: y-scope/clp-ffi-py#81
File: src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.hpp:0-0
Timestamp: 2024-11-14T06:30:18.577Z
Learning: In `PyKeyValuePairLogEvent`, we assume invalid objects will never be created, so null pointer checks in methods like `get_kv_pair_log_event()` are unnecessary, assuming CPython system runs correctly.
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.hpp (1)
Learnt from: LinZhihao-723
PR: y-scope/clp-ffi-py#81
File: src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.hpp:42-44
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In the C++ codebase, for consistency with other classes, methods like `get_kv_pair_log_event()` in `src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.hpp` should return pointers instead of references.
⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: Build cp310-win_amd64
  • GitHub Check: Build cp312-manylinux_aarch64
  • GitHub Check: Build cp311-musllinux_x86_64
  • GitHub Check: Build cp311-musllinux_i686
  • GitHub Check: Build cp311-musllinux_aarch64
  • GitHub Check: Build cp311-manylinux_x86_64
  • GitHub Check: Build cp311-manylinux_i686
  • GitHub Check: Build cp311-manylinux_aarch64
  • GitHub Check: Build cp310-musllinux_x86_64
  • GitHub Check: Build cp310-musllinux_i686
  • GitHub Check: Build cp310-musllinux_aarch64
  • GitHub Check: Build cp310-manylinux_x86_64
  • GitHub Check: Build cp310-manylinux_i686
  • GitHub Check: Build cp310-manylinux_aarch64
  • GitHub Check: Build cp311-macosx_arm64
  • GitHub Check: Build cp310-macosx_x86_64
  • GitHub Check: Build cp310-macosx_universal2
  • GitHub Check: Build cp310-macosx_arm64
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: lint-check (macos-latest)
🔇 Additional comments (11)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (3)

112-126: Documentation is clear and comprehensive

The updated docstring for the to_dict method accurately describes the purpose, parameters, and return value, providing helpful references to Python's standard encodings and error handlers. The use of Canadian English spelling in words like "behaviours" maintains consistency.


244-302: Optimized handling of encoding parameters

The implementation of PyKeyValuePairLogEvent_to_dict efficiently checks whether the default encoding and error parameters are used, selecting the appropriate Unicode decoding function accordingly. This approach minimizes performance overhead when default parameters are provided, aligning with the PR objective of maintaining optimal runtime performance.


Line range hint 422-434: Proper encapsulation within internal namespace

Moving the decode_as_encoded_text_ast function into the PyKeyValuePairLogEvent_internal namespace enhances code organization and encapsulation, clearly indicating its intended internal use within the module.

src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.hpp (4)

6-14: Necessary header inclusions for template support

The addition of headers such as <concepts>, <optional>, and <string_view> is appropriate to support the new templated methods and concepts introduced in the code.


27-43: Well-defined concept for string view to Unicode conversion

The StringViewToPyUnicodeMethodReq concept is correctly defined to enforce that the provided method converts a std::string_view to a PyObject*. This enhances type safety and ensures that the to_dict method operates with the expected interface.


121-131: Template-based to_dict method improves flexibility

Updating the to_dict method to be templated allows for customizable string decoding methods, providing greater flexibility while maintaining performance. This aligns with the PR's goal of supporting different encodings and error handling strategies.


129-131: Method get_kv_pair_log_event returns pointer as per codebase convention

The get_kv_pair_log_event method now returns a pointer instead of a reference, aligning with the project's coding conventions and the learnings from previous work.

tests/test_ir/test_key_value_pair_log_event.py (2)

1-3: Appropriate import additions for testing

The imports of BytesIO, Deserializer, and Serializer are necessary to support the new test cases added, ensuring that the test environment is properly set up.


46-86: Comprehensive test for invalid UTF-8 encoding handling

The test_invalid_utf8_encoding method effectively verifies the handling of invalid UTF-8 encoded strings with various encodings and error handling strategies. It ensures that exceptions are raised appropriately and that the to_dict method can successfully decode strings when correct parameters are provided.

clp_ffi_py/ir/native.pyi (2)

5-5: LGTM! Import statement is properly organized.

The addition of Tuple to the typing imports is appropriate and necessary for the return type annotation.


90-92: Verify the encoding parameter values match Python's supported encodings.

The method signature looks good and follows Python's string encoding conventions. Let's verify that the default values are supported.

✅ Verification successful

The encoding parameters are valid and well-supported

The default values encoding="utf-8" and errors="strict" are confirmed to be supported by Python's standard library and follow best practices for string encoding.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the default encoding and error handler values are supported by Python

python3 -c '
import codecs

# Check if utf-8 encoding is supported
try:
    codecs.lookup("utf-8")
    print("✓ utf-8 encoding is supported")
except LookupError:
    print("✗ utf-8 encoding is not supported")

# Check if strict error handler is supported
try:
    "test".encode("utf-8", errors="strict")
    print("✓ strict error handler is supported")
except LookupError:
    print("✗ strict error handler is not supported")

# List all supported error handlers for reference
print("\nSupported error handlers:")
print(codecs.lookup_error("strict").__class__.__name__)
for name in ["strict", "ignore", "replace", "xmlcharrefreplace", "backslashreplace", "namereplace"]:
    try:
        handler = codecs.lookup_error(name)
        print(f"- {name}")
    except LookupError:
        continue
'

Length of output: 996

Copy link
Member

@junhaoliao junhaoliao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic and Python interfaces lgtm. let's fix some nits in the docs

Comment on lines 116 to 117
" to those in Python's `str` objects, for converting C strings into Python Unicode"
" objects.\n\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

Suggested change
" to those in Python's `str` objects, for converting C strings into Python Unicode"
" objects.\n\n"
" to those in Python's `bytes.decode()` method, for converting C strings into Python"
" Unicode objects.\n\n"

or

Suggested change
" to those in Python's `str` objects, for converting C strings into Python Unicode"
" objects.\n\n"
" to those in Python's `codecs.decode()` method, for converting C strings into Python"
" Unicode objects.\n\n"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I referred to str since PyUnicode_Decode does so: https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_Decode

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. For better clarity, can we say "Python's str() built-in function" instead of "Python's str objects"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk make sense

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied some rewording to make it as close to PyUnicode_Decode's documentation as possible.

tests/test_ir/test_key_value_pair_log_event.py Outdated Show resolved Hide resolved
Copy link
Member

@junhaoliao junhaoliao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about

feat(KeyValuePairLogEvent): Extend `to_dict()` to support custom encoding and error handling (resolves #129).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
tests/test_ir/test_key_value_pair_log_event.py (3)

52-57: Consider adding more descriptive test data documentation.

The comment about msgpack map could be more detailed to explain why these specific byte values were chosen and what they represent in cp932 encoding.

-        # msgpack map: {"key": 0x970x5c}, where "0x970x5c" is encoded using "cp932"
+        # msgpack map: {"key": 0x970x5c}
+        # Note: The byte sequence 0x970x5c represents the character '\' (fullwidth reverse solidus)
+        # when decoded using cp932 encoding, but is invalid in UTF-8

69-70: Consider using more idiomatic assertions.

The combination of assertNotEqual and assert could be replaced with a single assertion.

-        self.assertNotEqual(log_event, None)
-        assert log_event is not None
+        self.assertIsNotNone(log_event, "Failed to deserialize log event")

46-86: Consider adding more test cases for comprehensive coverage.

The test covers basic scenarios but could be enhanced with:

  1. Different error handlers (e.g., "replace", "backslashreplace")
  2. Mixed valid/invalid UTF-8 sequences
  3. Edge cases like empty strings or very long strings

Would you like me to help generate additional test cases?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d4428f0 and cee99cc.

📒 Files selected for processing (1)
  • tests/test_ir/test_key_value_pair_log_event.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: Build cp311-win_amd64
  • GitHub Check: Build cp310-win_amd64
  • GitHub Check: Build cp311-musllinux_x86_64
  • GitHub Check: Build cp311-musllinux_i686
  • GitHub Check: Build cp311-musllinux_aarch64
  • GitHub Check: Build cp311-manylinux_x86_64
  • GitHub Check: Build cp311-manylinux_i686
  • GitHub Check: Build cp311-manylinux_aarch64
  • GitHub Check: Build cp310-musllinux_x86_64
  • GitHub Check: Build cp310-musllinux_i686
  • GitHub Check: Build cp310-musllinux_aarch64
  • GitHub Check: Build cp310-manylinux_x86_64
  • GitHub Check: Build cp310-manylinux_i686
  • GitHub Check: Build cp310-manylinux_aarch64
  • GitHub Check: Build cp311-macosx_arm64
  • GitHub Check: Build cp310-macosx_x86_64
  • GitHub Check: Build cp310-macosx_universal2
  • GitHub Check: Build cp310-macosx_arm64
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: lint-check (macos-latest)
🔇 Additional comments (1)
tests/test_ir/test_key_value_pair_log_event.py (1)

1-8: LGTM! Import changes are well-organized.

The new imports are properly categorized and all are utilized in the new test implementation.

@LinZhihao-723 LinZhihao-723 changed the title feat(KeyValuePairLogEvent): Add support for receiving standard Python unicode encoding types and error handlers from users to decode C++ strings into Python unicode objects in to_dict (resolves #129). feat(KeyValuePairLogEvent): Extend to_dict() to support custom Unicode encoding and error handling (resolves #129). Jan 23, 2025
@LinZhihao-723 LinZhihao-723 changed the title feat(KeyValuePairLogEvent): Extend to_dict() to support custom Unicode encoding and error handling (resolves #129). feat(KeyValuePairLogEvent): Extend to_dict() to support custom encoding and error handling for Unicode (resolves #129). Jan 23, 2025
@LinZhihao-723
Copy link
Member Author

For the PR title, how about

feat(KeyValuePairLogEvent): Extend `to_dict()` to support custom encoding and error handling (resolves #129).

Add "for Unicode" at the end to clarify what is the encoding target

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (2)

112-127: Refine docstring for clarity and adopt Canadian spelling.

These lines contain minor grammatical issues and use "behaviors" instead of "behaviours." Also, "build-in function" should be "built-in function," and consider adding "to" before "similar." For example:

- "The `encoding` and `errors` parameters allow customization of Unicode behaviors, similar"
- " Python's `str` build-in function, for converting C strings into Python Unicode"
+ "The `encoding` and `errors` parameters allow customisation of Unicode behaviours, similar to"
+ " Python's built-in `str` function, for converting C strings into Python Unicode"

244-302: Consider normalising encoding checks to handle case insensitivity.

Right now, the code treats “utf-8” differently from “UTF-8” or “utf8”. If a user passes a variation like “UTF8”, the code path will differ from the intended default path and potentially suffer a performance penalty. Also confirm if you want to accept synonyms for “utf-8” by normalising or ignoring case, if that is consistent with the rest of your library’s design.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cee99cc and 9f35e34.

📒 Files selected for processing (1)
  • src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (1)
Learnt from: LinZhihao-723
PR: y-scope/clp-ffi-py#81
File: src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.hpp:0-0
Timestamp: 2024-11-14T06:30:18.577Z
Learning: In `PyKeyValuePairLogEvent`, we assume invalid objects will never be created, so null pointer checks in methods like `get_kv_pair_log_event()` are unnecessary, assuming CPython system runs correctly.
⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: Build cp311-win_amd64
  • GitHub Check: Build cp310-win_amd64
  • GitHub Check: Build cp311-musllinux_x86_64
  • GitHub Check: Build cp311-musllinux_i686
  • GitHub Check: Build cp311-musllinux_aarch64
  • GitHub Check: Build cp311-manylinux_x86_64
  • GitHub Check: Build cp311-manylinux_i686
  • GitHub Check: Build cp311-manylinux_aarch64
  • GitHub Check: Build cp310-musllinux_x86_64
  • GitHub Check: Build cp310-musllinux_i686
  • GitHub Check: Build cp310-musllinux_aarch64
  • GitHub Check: Build cp310-manylinux_x86_64
  • GitHub Check: Build cp310-manylinux_i686
  • GitHub Check: Build cp310-manylinux_aarch64
  • GitHub Check: Build cp311-macosx_arm64
  • GitHub Check: Build cp310-macosx_x86_64
  • GitHub Check: Build cp310-macosx_universal2
  • GitHub Check: Build cp310-macosx_arm64
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: lint-check (macos-latest)
🔇 Additional comments (5)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (5)

132-134: Forward declaration aligns with docstring and usage.

The method signature matches the documented parameters and return type. No issues to address here.


145-145: Method table registration looks correct.

Using METH_VARARGS | METH_KEYWORDS is appropriate for a function with optional parameters.


420-420: Namespace closing brace verified.

No concerns here.


422-422: Namespace introduced for internal helper functions.

This nicely isolates internal logic while keeping the main namespace clean.


434-434: Closing brace confirmed.

No issues found.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (1)

115-126: Grammar refinement
Please replace “allowing users to custom the behaviour” with “allowing users to customize the behaviour.”

- name in Python's `str()` built-in function, allowing users to custom the behaviour of
+ name in Python's `str()` built-in function, allowing users to customize the behaviour of
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9f35e34 and 3f56801.

📒 Files selected for processing (1)
  • src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (1)
Learnt from: LinZhihao-723
PR: y-scope/clp-ffi-py#81
File: src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.hpp:0-0
Timestamp: 2024-11-14T06:30:18.577Z
Learning: In `PyKeyValuePairLogEvent`, we assume invalid objects will never be created, so null pointer checks in methods like `get_kv_pair_log_event()` are unnecessary, assuming CPython system runs correctly.
⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: Build cp311-win_amd64
  • GitHub Check: Build cp310-win_amd64
  • GitHub Check: Build cp311-musllinux_x86_64
  • GitHub Check: Build cp311-musllinux_i686
  • GitHub Check: Build cp311-musllinux_aarch64
  • GitHub Check: Build cp311-manylinux_x86_64
  • GitHub Check: Build cp311-manylinux_i686
  • GitHub Check: Build cp311-manylinux_aarch64
  • GitHub Check: Build cp310-musllinux_x86_64
  • GitHub Check: Build cp310-musllinux_i686
  • GitHub Check: Build cp310-musllinux_aarch64
  • GitHub Check: Build cp310-manylinux_x86_64
  • GitHub Check: Build cp310-manylinux_i686
  • GitHub Check: Build cp310-manylinux_aarch64
  • GitHub Check: Build cp311-macosx_arm64
  • GitHub Check: Build cp310-macosx_x86_64
  • GitHub Check: Build cp310-macosx_universal2
  • GitHub Check: Build cp310-macosx_arm64
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: lint-check (macos-latest)
🔇 Additional comments (15)
src/clp_ffi_py/ir/native/PyKeyValuePairLogEvent.cpp (15)

112-112: Docstring reflects new parameters accurately
This update aligns with the new parameter signature for to_dict.


145-145: Method definition flags look good
Switching to METH_VARARGS | METH_KEYWORDS is consistent with accepting keyword arguments.


244-245: Implementation matches the declaration
The method definition aligns with the updated docstring and signature.


246-246: Return type is clear
The return type PyObject* is appropriate for CPython API usage.


247-248: Keyword parameters named well
The variables keyword_encoding and keyword_errors are clear and consistent with the documentation.


249-253: Keyword table structure is correct
Defining the table for argument parsing is set up properly.


255-256: Constants for default encoding and errors
Using constexpr std::string_view improves clarity and avoids magic string literals.


258-262: Default parameter initialization
Assigning defaults for encoding_c_str and errors_c_str follows good practice.


263-276: Robust argument parsing
The implementation correctly parses optional parameters for encoding and errors, returning early if parsing fails.


277-286: Selective handling for non-default encoding
Using PyUnicode_Decode with custom encoding and error handling ensures flexibility for client code.


288-298: Conditional branch for error handling
Distinguishing between custom error handlers with a default encoding is a well-structured approach, relying on PyUnicode_DecodeUTF8.


299-302: Efficient fallback for defaults
Falling back to PyUnicode_FromStringAndSize for the common default scenario avoids extra overhead.


420-420: No issues
The unnamed namespace closure is consistent with typical C++ usage.


Line range hint 422-434: Internal namespace usage
Encapsulating helper logic within PyKeyValuePairLogEvent_internal keeps the interface clean.


132-133: Ensure all function calls are updated to pass new arguments
This method signature now expects args and keywords. Please confirm that calling sites throughout the codebase are up-to-date.

@junhaoliao
Copy link
Member

junhaoliao commented Jan 24, 2025

For the PR title, how about

feat(KeyValuePairLogEvent): Extend `to_dict()` to support custom encoding and error handling (resolves #129).

Add "for Unicode" at the end to clarify what is the encoding target

It seems Python 3 raises `UnicodeError` or its subclasses whenever some encoding / decoding error happens regardless whether the encoding is "Unicode" or not. e.g., ``` PyDev console: starting. Python 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)] on win32 "我爱学习".encode("gb2312").decode("ascii") Traceback (most recent call last): File "C:\Users\Junhao\AppData\Local\Programs\PyCharm Professional 2\plugins\python-ce\helpers\pydev\pydevconsole.py", line 364, in runcode coro = func() File "", line 1, in UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 0: ordinal not in range(128) ```

It's definitely Python being confusing in the first place.

I had some misconceptions about how Python uses Unicode as the encoding for all str objects.
Still, I think adding UnicodeError / UnicodeDecodeError might promote better clarity since we are providing ways to mitigate errors that happen during "bytes in arbitrary encoding" -> "Unicode".

feat(KeyValuePairLogEvent): Extend `to_dict()` to support custom encoding and error handling for `UnicodeDecodeError` (resolves #129).

@LinZhihao-723
Copy link
Member Author

For the PR title, how about

feat(KeyValuePairLogEvent): Extend `to_dict()` to support custom encoding and error handling (resolves #129).

Add "for Unicode" at the end to clarify what is the encoding target

It seems Python 3 raises UnicodeError or its subclasses whenever some encoding / decoding error happens regardless whether the encoding is "Unicode" or not. e.g., PyDev console: starting. Python 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)] on win32 "我爱学习".encode("gb2312").decode("ascii") Traceback (most recent call last): File "C:\Users\Junhao\AppData\Local\Programs\PyCharm Professional 2\plugins\python-ce\helpers\pydev\pydevconsole.py", line 364, in runcode coro = func() File "", line 1, in UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 0: ordinal not in range(128)
It's definitely Python being confusing in the first place.

I had some misconceptions about how Python uses Unicode as the encoding for all str objects. Still, I think adding UnicodeError / UnicodeDecodeError might promote better clarity since we are providing ways to mitigate errors that happen during "bytes in arbitrary encoding" -> "Unicode".

feat(KeyValuePairLogEvent): Extend `to_dict()` to support custom encoding and error handling for `UnicodeDecodeError` (resolves #129).

Yeah, in Python3 str and unicode object are the same. So decode(encoding=xxx) is basically saying we decode a byte sequence using xxx encoding convention, and turn the results into a Unicode object.

@LinZhihao-723 LinZhihao-723 changed the title feat(KeyValuePairLogEvent): Extend to_dict() to support custom encoding and error handling for Unicode (resolves #129). feat(KeyValuePairLogEvent): Extend to_dict() to support custom encoding and error handling for UnicodeDecodeError (resolves #129). Jan 24, 2025
@LinZhihao-723 LinZhihao-723 merged commit 7e7eff4 into y-scope:main Jan 24, 2025
73 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants