Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for serializing/deserializing user-defined stream-level metadata. #133

Merged
merged 9 commits into from
Jan 26, 2025

Conversation

LinZhihao-723
Copy link
Member

@LinZhihao-723 LinZhihao-723 commented Jan 26, 2025

Description

This PR depends on #132, #134.
This PR adds support for serializing/deserializing user-defined stream-level metadata. As discussed offline, we decided to constrain the metadata input into a JSON string to simplify the type-checking.

Validation performed

  • Ensure all workflows passed.
  • Update existing unit tests to ensure user-defined metadata can be successfully serialized/deserialized.

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Added support for user-defined metadata in serialization and deserialization processes.
    • Introduced ability to pass and retrieve custom metadata during stream processing.
    • New function for serializing Python dictionaries to JSON strings.
  • Documentation

    • Updated method signatures and documentation to explain new metadata functionality.
  • Tests

    • Added test cases to validate user-defined metadata handling.
    • Implemented error checking for invalid metadata inputs.

The changes enhance the serialization framework by allowing users to attach and retrieve custom metadata during stream operations, providing more flexibility in data processing.

Copy link

coderabbitai bot commented Jan 26, 2025

Walkthrough

This pull request introduces user-defined metadata functionality to the serialization and deserialization process. The changes span multiple files across the project, adding a new optional parameter user_defined_metadata to the Serializer class and a corresponding get_user_defined_metadata method to the Deserializer class. The implementation includes type checking, JSON serialization, and integration with existing serialization infrastructure, allowing users to attach custom metadata during the serialization process.

Changes

File Change Summary
clp_ffi_py/ir/native.pyi Updated Serializer constructor with optional user_defined_metadata parameter; Added get_user_defined_metadata method to Deserializer
src/clp_ffi_py/ir/native/PyDeserializer.* Added get_user_defined_metadata method to retrieve stream-level metadata
src/clp_ffi_py/ir/native/PySerializer.cpp Modified __init__ to accept and process user_defined_metadata
clp_ffi_py/utils.py Added serialize_dict_to_json_str utility function
src/clp_ffi_py/Py_utils.* Added JSON serialization support and related utility functions
tests/test_ir/test_serder.py Updated test base class to include user-defined metadata testing
tests/test_ir/test_serializer.py Added test for invalid user-defined metadata scenarios
src/clp Updated subproject commit identifier

Possibly Related PRs

Suggested Reviewers

  • junhaoliao
✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e7eff4 and f49a098.

📒 Files selected for processing (6)
  • clp_ffi_py/ir/native.pyi (2 hunks)
  • src/clp (1 hunks)
  • src/clp_ffi_py/ir/native/PyDeserializer.cpp (5 hunks)
  • src/clp_ffi_py/ir/native/PyDeserializer.hpp (2 hunks)
  • src/clp_ffi_py/ir/native/PySerializer.cpp (6 hunks)
  • tests/test_ir/test_serder.py (4 hunks)
✅ Files skipped from review due to trivial changes (1)
  • src/clp
🧰 Additional context used
📓 Path-based instructions (3)
src/clp_ffi_py/ir/native/PyDeserializer.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/clp_ffi_py/ir/native/PyDeserializer.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/clp_ffi_py/ir/native/PySerializer.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
src/clp_ffi_py/ir/native/PyDeserializer.cpp (3)
Learnt from: LinZhihao-723
PR: y-scope/clp-ffi-py#94
File: src/clp_ffi_py/ir/native/PyDeserializer.hpp:246-248
Timestamp: 2024-11-23T05:54:37.767Z
Learning: In the `src/clp_ffi_py/ir/native` module, raw pointers with `gsl::owner` are intentionally used instead of `std::unique_ptr` to comply with CPython's memory management requirements.
Learnt from: LinZhihao-723
PR: y-scope/clp-ffi-py#94
File: src/clp_ffi_py/ir/native/PyDeserializer.cpp:218-242
Timestamp: 2024-11-23T05:55:57.106Z
Learning: In `src/clp_ffi_py/ir/native/PyDeserializer.cpp`, the `deserialize_log_event` method in the `PyDeserializer` class ensures proper loop termination and does not risk infinite loops, even when handling incomplete or corrupted streams.
Learnt from: LinZhihao-723
PR: y-scope/clp-ffi-py#94
File: src/clp_ffi_py/ir/native/PyDeserializer.cpp:206-208
Timestamp: 2024-11-23T05:55:08.019Z
Learning: In code interfacing with CPython (e.g., `src/clp_ffi_py/ir/native/PyDeserializer.cpp`), using raw pointers for memory management is acceptable due to CPython's memory management requirements.
⏰ Context from checks skipped due to timeout of 90000ms (18)
  • GitHub Check: Build cp310-win_amd64
  • GitHub Check: Build cp311-musllinux_i686
  • GitHub Check: Build cp311-musllinux_aarch64
  • GitHub Check: Build cp311-manylinux_x86_64
  • GitHub Check: Build cp311-manylinux_i686
  • GitHub Check: Build cp311-manylinux_aarch64
  • GitHub Check: Build cp310-musllinux_x86_64
  • GitHub Check: Build cp310-musllinux_i686
  • GitHub Check: Build cp310-musllinux_aarch64
  • GitHub Check: Build cp310-manylinux_x86_64
  • GitHub Check: Build cp310-manylinux_i686
  • GitHub Check: Build cp310-manylinux_aarch64
  • GitHub Check: Build cp311-macosx_arm64
  • GitHub Check: Build cp310-macosx_x86_64
  • GitHub Check: Build cp310-macosx_universal2
  • GitHub Check: Build cp310-macosx_arm64
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: lint-check (macos-latest)
🔇 Additional comments (15)
clp_ffi_py/ir/native.pyi (2)

93-98: LGTM! The __init__ signature is well-defined.

The optional user_defined_metadata parameter is correctly typed and positioned with an appropriate default value.


121-121: LGTM! The method signature is well-defined.

The return type Optional[Dict[str, Any]] is appropriate for JSON-like metadata.

tests/test_ir/test_serder.py (4)

1-1: LGTM! Required import is added.

The json import is necessary for metadata serialization/deserialization.


20-35: LGTM! Comprehensive test data covering all JSON types.

The test data includes:

  • Primitive types (int, float, bool, string)
  • Special values (null)
  • Complex types (arrays, nested objects)

70-72: LGTM! Proper metadata serialization.

The metadata is correctly converted to a JSON string before being passed to the serializer.


124-127: LGTM! Proper metadata verification.

The test verifies that the deserialized metadata matches the original data using deep comparison.

src/clp_ffi_py/ir/native/PyDeserializer.hpp (2)

15-15: LGTM! Required headers are added.

The new headers support JSON handling and protocol constants for metadata keys.

Also applies to: 19-19


113-118: LGTM! Method declaration is well-defined.

The method:

  • Uses [[nodiscard]] to enforce return value handling
  • Returns a const pointer to prevent metadata modification
  • Has clear documentation
src/clp_ffi_py/ir/native/PyDeserializer.cpp (5)

6-6: LGTM! Required headers are added.

The headers provide necessary support for string handling, protocol constants, and JSON operations.

Also applies to: 14-14, 19-19


75-90: LGTM! Documentation follows Python conventions.

The docstring:

  • Clearly describes the functionality
  • Documents return types and values
  • Uses proper reST formatting

104-108: LGTM! Method is properly exposed to Python.

The method table entry correctly sets up the Python binding with appropriate flags and documentation.


183-223: LGTM! Robust Python method implementation.

The implementation:

  • Handles all error cases (type errors, serialization failures)
  • Properly converts JSON to Python dictionary
  • Follows Python reference counting rules

354-363: LGTM! Clean C++ implementation.

The implementation:

  • Uses protocol-defined constants for metadata key
  • Safely handles missing metadata case
  • Returns const pointer to prevent modification
src/clp_ffi_py/ir/native/PySerializer.cpp (2)

15-15: LGTM! Documentation and includes are well-structured.

The docstring changes properly document the new parameter, and the necessary JSON library is included for metadata parsing.

Also applies to: 36-48


319-319: LGTM! Proper handling of optional metadata.

The optional metadata is correctly passed to the serializer's create method.

Comment on lines 287 to 317
std::optional<nlohmann::json> optional_user_defined_metadata;
if (Py_None != user_defined_metadata) {
if (false == static_cast<bool>(PyUnicode_CheckExact(user_defined_metadata))) {
PyErr_Format(
PyExc_TypeError,
"`%s` must be given as a `str` object.",
static_cast<char const*>(keyword_user_defined_metadata)
);
return -1;
}
Py_ssize_t json_str_size{};
auto const* json_str_data{PyUnicode_AsUTF8AndSize(user_defined_metadata, &json_str_size)};
if (nullptr == json_str_data) {
return -1;
}
auto parsed_user_defined_metadata = nlohmann::json::parse(
std::string_view{json_str_data, static_cast<size_t>(json_str_size)},
nullptr,
false
);
if (parsed_user_defined_metadata.is_discarded()) {
PyErr_Format(
PyExc_RuntimeError,
"Failed to parse `%s`: %s",
static_cast<char const*>(keyword_user_defined_metadata),
json_str_data
);
return -1;
}
optional_user_defined_metadata = std::move(parsed_user_defined_metadata);
}
Copy link

@coderabbitai coderabbitai bot Jan 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add validation to ensure parsed metadata is a JSON object.

While the JSON parsing is handled well, there's no validation to ensure the parsed result is specifically a JSON object. Other JSON types (arrays, primitives) should be rejected.

Add this validation after line 306:

         if (parsed_user_defined_metadata.is_discarded()) {
             PyErr_Format(
                     PyExc_RuntimeError,
                     "Failed to parse `%s`: %s",
                     static_cast<char const*>(keyword_user_defined_metadata),
                     json_str_data
             );
             return -1;
         }
+        if (!parsed_user_defined_metadata.is_object()) {
+            PyErr_Format(
+                    PyExc_ValueError,
+                    "`%s` must be a JSON object, got %s",
+                    static_cast<char const*>(keyword_user_defined_metadata),
+                    parsed_user_defined_metadata.type_name()
+            );
+            return -1;
+        }
         optional_user_defined_metadata = std::move(parsed_user_defined_metadata);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::optional<nlohmann::json> optional_user_defined_metadata;
if (Py_None != user_defined_metadata) {
if (false == static_cast<bool>(PyUnicode_CheckExact(user_defined_metadata))) {
PyErr_Format(
PyExc_TypeError,
"`%s` must be given as a `str` object.",
static_cast<char const*>(keyword_user_defined_metadata)
);
return -1;
}
Py_ssize_t json_str_size{};
auto const* json_str_data{PyUnicode_AsUTF8AndSize(user_defined_metadata, &json_str_size)};
if (nullptr == json_str_data) {
return -1;
}
auto parsed_user_defined_metadata = nlohmann::json::parse(
std::string_view{json_str_data, static_cast<size_t>(json_str_size)},
nullptr,
false
);
if (parsed_user_defined_metadata.is_discarded()) {
PyErr_Format(
PyExc_RuntimeError,
"Failed to parse `%s`: %s",
static_cast<char const*>(keyword_user_defined_metadata),
json_str_data
);
return -1;
}
optional_user_defined_metadata = std::move(parsed_user_defined_metadata);
}
std::optional<nlohmann::json> optional_user_defined_metadata;
if (Py_None != user_defined_metadata) {
if (false == static_cast<bool>(PyUnicode_CheckExact(user_defined_metadata))) {
PyErr_Format(
PyExc_TypeError,
"`%s` must be given as a `str` object.",
static_cast<char const*>(keyword_user_defined_metadata)
);
return -1;
}
Py_ssize_t json_str_size{};
auto const* json_str_data{PyUnicode_AsUTF8AndSize(user_defined_metadata, &json_str_size)};
if (nullptr == json_str_data) {
return -1;
}
auto parsed_user_defined_metadata = nlohmann::json::parse(
std::string_view{json_str_data, static_cast<size_t>(json_str_size)},
nullptr,
false
);
if (parsed_user_defined_metadata.is_discarded()) {
PyErr_Format(
PyExc_RuntimeError,
"Failed to parse `%s`: %s",
static_cast<char const*>(keyword_user_defined_metadata),
json_str_data
);
return -1;
}
if (!parsed_user_defined_metadata.is_object()) {
PyErr_Format(
PyExc_ValueError,
"`%s` must be a JSON object, got %s",
static_cast<char const*>(keyword_user_defined_metadata),
parsed_user_defined_metadata.type_name()
);
return -1;
}
optional_user_defined_metadata = std::move(parsed_user_defined_metadata);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the rabbit if we're to accept a str type metadata.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junhaoliao Thanks for confirming! I'll add this to our knowledge base to ensure we maintain this validation requirement for JSON metadata.


✏️ Learnings added
Learnt from: junhaoliao
PR: y-scope/clp-ffi-py#133
File: src/clp_ffi_py/ir/native/PySerializer.cpp:287-317
Timestamp: 2025-01-26T04:49:24.199Z
Learning: In clp-ffi-py, when accepting string input for JSON metadata, validation must be performed to ensure the parsed result is specifically a JSON object (not arrays or primitives) using the `is_object()` method from nlohmann::json.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type validation will be handled by ffi core API, when we use this parsed object to create a serializer.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

":return:\n"
" - The deserialized user-defined stream-level metadata, loaded as a"
" dictionary.\n"
" - None if user-defined stream-level metadata was not given in the deserialized"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be less handling for API users if we return an empty dictionary if "user-defined stream-level metadata was not given in the deserialized IR stream"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though, it does seem in the Serializer we do permit users to provide a None or "{}" value. I haven't traced down to the CLP code but I think the given metadata string will be encoded as-is?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to return None indicating that the user-defined metadata wasn't specified in the source.
It is possible to encode an empty dictionary as the metadata though.

if (false == user_defined_metadata->is_object()) {
PyErr_SetString(
PyExc_TypeError,
"User-defined stream-level metadata is not a JSON object."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the type of this JSON value is not an object, it means either the file is not created using a Serializer we provide, or something went wrong in our code, right?

Though i imagine such cases are rare, would it be more helpful to debug such case if we include the JSON value type in the error message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the only possible case is that someone created a stream that follows our format but doesn't enforce the use of a JSON object as the metadata. Technically we should check this in the ffi core. Let me think about it...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As improved in y-scope/clp#695, we moved this check to the ffi core. The latest commit should have this check removed (however, we need to merge #134 first).

return nullptr;
}

PyObjectPtr<PyObject> py_metadata_dict{py_utils_parse_json_str(json_str)};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for not being able find any previous discussions on this - we have considered the https://github.com/pybind/pybind11_json and other alternatives discussed at https://www.github.com/nlohmann/json/issues/1014 , right?

just curious, for more efficient conversions, do we have plans to make use of those or create our own bindings?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this isn't in any of the critical path anyways, I would probably defer it to future release

@@ -42,6 +43,9 @@ PyDoc_STRVAR(
":param buffer_size_limit: The maximum amount of serialized data to buffer before flushing"
" it to `output_stream`. Defaults to 64 KiB.\n"
":type buffer_size_limit: int\n"
":param user_defined_metadata: JSON-encoded string containing user-defined stream-level"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an easier to use interface is to accept a dict type. If we're time-constraint to implement that, that's fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. I think we can fix it with a Python API binding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 287 to 317
std::optional<nlohmann::json> optional_user_defined_metadata;
if (Py_None != user_defined_metadata) {
if (false == static_cast<bool>(PyUnicode_CheckExact(user_defined_metadata))) {
PyErr_Format(
PyExc_TypeError,
"`%s` must be given as a `str` object.",
static_cast<char const*>(keyword_user_defined_metadata)
);
return -1;
}
Py_ssize_t json_str_size{};
auto const* json_str_data{PyUnicode_AsUTF8AndSize(user_defined_metadata, &json_str_size)};
if (nullptr == json_str_data) {
return -1;
}
auto parsed_user_defined_metadata = nlohmann::json::parse(
std::string_view{json_str_data, static_cast<size_t>(json_str_size)},
nullptr,
false
);
if (parsed_user_defined_metadata.is_discarded()) {
PyErr_Format(
PyExc_RuntimeError,
"Failed to parse `%s`: %s",
static_cast<char const*>(keyword_user_defined_metadata),
json_str_data
);
return -1;
}
optional_user_defined_metadata = std::move(parsed_user_defined_metadata);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the rabbit if we're to accept a str type metadata.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/clp_ffi_py/Py_utils.cpp (1)

135-158: Enhance the type error message.

The error message could be more helpful by including the actual type received.

Apply this diff to improve the error message:

     if (false == static_cast<bool>(PyUnicode_Check(result))) {
         PyErr_Format(
                 PyExc_TypeError,
-                "`%s` is supposed to return a `str` object",
-                cPyFuncNameSerializeDictToJsonStr
+                "`%s` is supposed to return a `str` object, got `%s` instead",
+                cPyFuncNameSerializeDictToJsonStr,
+                Py_TYPE(result)->tp_name
         );
         return nullptr;
     }
tests/test_ir/test_serializer.py (1)

169-175: Add more test cases for user-defined metadata.

While the current test cases verify error handling for invalid types, consider adding the following test cases for better coverage:

  1. Valid dictionary with simple values
  2. None value (default case)
  3. Dictionary with nested structures
  4. Dictionary with invalid JSON values (e.g., custom objects)

Here's a suggested implementation:

     def test_invalid_user_defined_metadata(self) -> None:
         byte_buffer: BytesIO = BytesIO()
         with self.assertRaises(TypeError):
             _ = Serializer(byte_buffer, user_defined_metadata="str")  # type: ignore
         with self.assertRaises(TypeError):
             _ = Serializer(byte_buffer, user_defined_metadata=[1, "str"])  # type: ignore
+
+    def test_valid_user_defined_metadata(self) -> None:
+        byte_buffer: BytesIO = BytesIO()
+        # Test None (default)
+        serializer = Serializer(byte_buffer)
+        serializer.close()
+
+        # Test valid dictionary with simple values
+        metadata = {"key": "value", "number": 42}
+        serializer = Serializer(byte_buffer, user_defined_metadata=metadata)
+        serializer.close()
+
+        # Test dictionary with nested structures
+        metadata = {"nested": {"key": "value"}, "array": [1, 2, 3]}
+        serializer = Serializer(byte_buffer, user_defined_metadata=metadata)
+        serializer.close()
+
+        # Test dictionary with invalid JSON values
+        class CustomObject:
+            pass
+        with self.assertRaises(TypeError):
+            metadata = {"invalid": CustomObject()}
+            _ = Serializer(byte_buffer, user_defined_metadata=metadata)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f49a098 and d492645.

📒 Files selected for processing (8)
  • clp_ffi_py/ir/native.pyi (2 hunks)
  • clp_ffi_py/utils.py (1 hunks)
  • src/clp_ffi_py/PyObjectCast.hpp (1 hunks)
  • src/clp_ffi_py/Py_utils.cpp (3 hunks)
  • src/clp_ffi_py/Py_utils.hpp (1 hunks)
  • src/clp_ffi_py/ir/native/PySerializer.cpp (6 hunks)
  • tests/test_ir/test_serder.py (3 hunks)
  • tests/test_ir/test_serializer.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/test_ir/test_serder.py
  • clp_ffi_py/ir/native.pyi
🧰 Additional context used
📓 Path-based instructions (4)
src/clp_ffi_py/PyObjectCast.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/clp_ffi_py/Py_utils.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/clp_ffi_py/Py_utils.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/clp_ffi_py/ir/native/PySerializer.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (2)
src/clp_ffi_py/Py_utils.hpp (2)
Learnt from: LinZhihao-723
PR: y-scope/clp-ffi-py#80
File: src/clp_ffi_py/Py_utils.cpp:80-87
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `src/clp_ffi_py/Py_utils.cpp`, for functions like `py_utils_serialize_dict_to_msgpack`, type checking is handled by the Python utilities, and we avoid adding redundant type checks in the C++ code.
Learnt from: LinZhihao-723
PR: y-scope/clp-ffi-py#83
File: src/clp_ffi_py/Py_utils.cpp:114-120
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In the `src/clp_ffi_py/Py_utils.cpp` file, for the `py_utils_parse_json_str` function, explicit UTF-8 validation of the input string is unnecessary because Python will throw exceptions if the input string is not valid UTF-8.
src/clp_ffi_py/Py_utils.cpp (1)
Learnt from: LinZhihao-723
PR: y-scope/clp-ffi-py#80
File: src/clp_ffi_py/Py_utils.cpp:80-87
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `src/clp_ffi_py/Py_utils.cpp`, for functions like `py_utils_serialize_dict_to_msgpack`, type checking is handled by the Python utilities, and we avoid adding redundant type checks in the C++ code.
⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: Build cp311-win_amd64
  • GitHub Check: Build cp310-win_amd64
  • GitHub Check: Build cp312-manylinux_aarch64
  • GitHub Check: Build cp311-musllinux_x86_64
  • GitHub Check: Build cp311-musllinux_i686
  • GitHub Check: Build cp311-musllinux_aarch64
  • GitHub Check: Build cp311-manylinux_x86_64
  • GitHub Check: Build cp311-manylinux_i686
  • GitHub Check: Build cp311-manylinux_aarch64
  • GitHub Check: Build cp310-musllinux_x86_64
  • GitHub Check: Build cp310-musllinux_i686
  • GitHub Check: Build cp310-musllinux_aarch64
  • GitHub Check: Build cp310-manylinux_x86_64
  • GitHub Check: Build cp310-manylinux_i686
  • GitHub Check: Build cp310-manylinux_aarch64
  • GitHub Check: Build cp311-macosx_universal2
  • GitHub Check: Build cp311-macosx_arm64
  • GitHub Check: Build cp310-macosx_x86_64
  • GitHub Check: Build cp310-macosx_universal2
  • GitHub Check: Build cp310-macosx_arm64
🔇 Additional comments (8)
src/clp_ffi_py/Py_utils.hpp (1)

48-55: LGTM!

The function declaration is well-documented and follows the existing patterns in the codebase.

src/clp_ffi_py/PyObjectCast.hpp (1)

136-136: LGTM!

The macro specialization for PyUnicodeObject is correctly added and follows the existing pattern.

src/clp_ffi_py/Py_utils.cpp (2)

19-19: LGTM!

The constant and static pointer declarations follow the existing patterns.

Also applies to: 26-26


73-80: LGTM!

The initialization code follows the existing patterns and includes proper error handling.

src/clp_ffi_py/ir/native/PySerializer.cpp (4)

Line range hint 37-51: LGTM! Clear and comprehensive documentation.

The documentation for the new user_defined_metadata parameter is well-written and includes all necessary details about the parameter type and validation requirements.


Line range hint 228-251: LGTM! Parameter parsing is correctly implemented.

The changes properly integrate the new optional parameter into the existing argument parsing infrastructure.


331-331: LGTM! Metadata correctly passed to serializer.

The optional metadata is properly passed to the serializer's create method.


290-329: Note regarding JSON object validation.

Previous review discussion indicated that JSON object type validation will be handled by the FFI core API rather than at this layer.

Let's verify the FFI core API validation:

✅ Verification successful

JSON validation delegation to FFI core API confirmed

The code correctly delegates JSON validation to the FFI core API through clp::ffi::ir_stream::Serializer. This aligns with the previous review discussion.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for JSON object validation in FFI core API
rg -A 5 'create.*user_defined_metadata' 

Length of output: 638


Script:

#!/bin/bash
# Search for ClpIrSerializer class and create method implementation
ast-grep --pattern 'class ClpIrSerializer {
  $$$
  static $_ create($_) {
    $$$
  }
  $$$
}'

Length of output: 97


Script:

#!/bin/bash
# Find files containing ClpIrSerializer definition
rg -l "class.*ClpIrSerializer" --type cpp --type hpp

Length of output: 82


Script:

#!/bin/bash
# Search for ClpIrSerializer class and create method
rg -A 10 "ClpIrSerializer" 

Length of output: 4822


Script:

#!/bin/bash
# Search for the core Serializer implementation
rg -A 10 "namespace.*ir_stream" --type cpp --type hpp
rg -A 10 "class.*Serializer.*ir_stream" --type cpp --type hpp

Length of output: 175

Comment on lines 51 to 62
def serialize_dict_to_json_str(dictionary: Dict[Any, Any]) -> str:
"""
Serializes the given dictionary into a JSON string.

:param dictionary: The dictionary to serialize.
:return: JSON string of serialized dictionary.
:raises: TypeError The given input is not a dictionary.
"""
if not isinstance(dictionary, dict):
raise TypeError("The type of the input object must be a dictionary.")
return json.dumps(dictionary)

Copy link

@coderabbitai coderabbitai bot Jan 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for JSON serialization failures.

The function should handle potential JSON serialization errors that could occur with complex Python objects or circular references.

Apply this diff to add error handling:

 def serialize_dict_to_json_str(dictionary: Dict[Any, Any]) -> str:
     """
     Serializes the given dictionary into a JSON string.

     :param dictionary: The dictionary to serialize.
     :return: JSON string of serialized dictionary.
     :raises: TypeError The given input is not a dictionary.
+    :raises: TypeError The dictionary contains values that are not JSON serializable.
     """
     if not isinstance(dictionary, dict):
         raise TypeError("The type of the input object must be a dictionary.")
-    return json.dumps(dictionary)
+    try:
+        return json.dumps(dictionary)
+    except TypeError as e:
+        raise TypeError(f"The dictionary contains values that are not JSON serializable: {e}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def serialize_dict_to_json_str(dictionary: Dict[Any, Any]) -> str:
"""
Serializes the given dictionary into a JSON string.
:param dictionary: The dictionary to serialize.
:return: JSON string of serialized dictionary.
:raises: TypeError The given input is not a dictionary.
"""
if not isinstance(dictionary, dict):
raise TypeError("The type of the input object must be a dictionary.")
return json.dumps(dictionary)
def serialize_dict_to_json_str(dictionary: Dict[Any, Any]) -> str:
"""
Serializes the given dictionary into a JSON string.
:param dictionary: The dictionary to serialize.
:return: JSON string of serialized dictionary.
:raises: TypeError The given input is not a dictionary.
:raises: TypeError The dictionary contains values that are not JSON serializable.
"""
if not isinstance(dictionary, dict):
raise TypeError("The type of the input object must be a dictionary.")
try:
return json.dumps(dictionary)
except TypeError as e:
raise TypeError(f"The dictionary contains values that are not JSON serializable: {e}")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is re-throwing the same exception here making any difference?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LinZhihao-723 You make a good point. Re-throwing the same TypeError doesn't add significant value here since json.dumps already provides a descriptive error message indicating which value couldn't be serialized. The extra try/except block would just add complexity without meaningful benefit. The original implementation letting json.dumps' TypeError propagate naturally is cleaner and sufficient.


✏️ Learnings added
Learnt from: LinZhihao-723
PR: y-scope/clp-ffi-py#133
File: clp_ffi_py/utils.py:51-62
Timestamp: 2025-01-26T10:21:10.011Z
Learning: For the serialize_dict_to_json_str function in clp_ffi_py/utils.py, let json.dumps' TypeError propagate naturally as it already provides sufficient error context, rather than catching and re-throwing with a custom message.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

junhaoliao
junhaoliao previously approved these changes Jan 26, 2025
clp_ffi_py/utils.py Outdated Show resolved Hide resolved
Copy link
Member

@junhaoliao junhaoliao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pr title is fine to be used as the final commit message

Co-authored-by: Junhao Liao <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/clp_ffi_py/ir/native/PyDeserializer.cpp (1)

183-215: LGTM! Consider future performance optimization.

The implementation is robust with proper error handling and type verification. As discussed in previous reviews, consider using pybind11_json in a future release for more efficient JSON conversions.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d492645 and eb63020.

📒 Files selected for processing (3)
  • clp_ffi_py/utils.py (1 hunks)
  • src/clp (1 hunks)
  • src/clp_ffi_py/ir/native/PyDeserializer.cpp (5 hunks)
✅ Files skipped from review due to trivial changes (1)
  • src/clp
🚧 Files skipped from review as they are similar to previous changes (1)
  • clp_ffi_py/utils.py
🧰 Additional context used
📓 Path-based instructions (1)
src/clp_ffi_py/ir/native/PyDeserializer.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
src/clp_ffi_py/ir/native/PyDeserializer.cpp (3)
Learnt from: LinZhihao-723
PR: y-scope/clp-ffi-py#94
File: src/clp_ffi_py/ir/native/PyDeserializer.hpp:246-248
Timestamp: 2024-11-23T05:54:37.767Z
Learning: In the `src/clp_ffi_py/ir/native` module, raw pointers with `gsl::owner` are intentionally used instead of `std::unique_ptr` to comply with CPython's memory management requirements.
Learnt from: LinZhihao-723
PR: y-scope/clp-ffi-py#94
File: src/clp_ffi_py/ir/native/PyDeserializer.cpp:218-242
Timestamp: 2024-11-23T05:55:57.106Z
Learning: In `src/clp_ffi_py/ir/native/PyDeserializer.cpp`, the `deserialize_log_event` method in the `PyDeserializer` class ensures proper loop termination and does not risk infinite loops, even when handling incomplete or corrupted streams.
Learnt from: LinZhihao-723
PR: y-scope/clp-ffi-py#94
File: src/clp_ffi_py/ir/native/PyDeserializer.cpp:206-208
Timestamp: 2024-11-23T05:55:08.019Z
Learning: In code interfacing with CPython (e.g., `src/clp_ffi_py/ir/native/PyDeserializer.cpp`), using raw pointers for memory management is acceptable due to CPython's memory management requirements.
⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: Build cp311-win_amd64
  • GitHub Check: Build cp310-win_amd64
  • GitHub Check: Build cp312-manylinux_aarch64
  • GitHub Check: Build cp311-musllinux_x86_64
  • GitHub Check: Build cp311-musllinux_i686
  • GitHub Check: Build cp311-musllinux_aarch64
  • GitHub Check: Build cp311-manylinux_x86_64
  • GitHub Check: Build cp311-manylinux_i686
  • GitHub Check: Build cp311-manylinux_aarch64
  • GitHub Check: Build cp310-musllinux_x86_64
  • GitHub Check: Build cp310-musllinux_i686
  • GitHub Check: Build cp310-musllinux_aarch64
  • GitHub Check: Build cp310-manylinux_x86_64
  • GitHub Check: Build cp310-manylinux_i686
  • GitHub Check: Build cp310-manylinux_aarch64
  • GitHub Check: Build cp311-macosx_universal2
  • GitHub Check: Build cp311-macosx_arm64
  • GitHub Check: Build cp310-macosx_x86_64
  • GitHub Check: Build cp310-macosx_universal2
  • GitHub Check: Build cp310-macosx_arm64
🔇 Additional comments (5)
src/clp_ffi_py/ir/native/PyDeserializer.cpp (5)

6-6: LGTM! Required headers added.

The new headers are necessary for string handling, protocol constants, and JSON processing.

Also applies to: 14-14, 19-19


75-89: LGTM! Clear and comprehensive documentation.

The documentation clearly specifies the return type and behaviour, aligning with the discussed approach of returning None for missing metadata.


90-90: LGTM! Method declaration follows conventions.

The declaration is consistent with the project's Python C API binding patterns.


104-107: LGTM! Method table entry is properly configured.

The entry correctly sets up the Python binding with no arguments and links to the documentation.


346-355: LGTM! Safe and efficient implementation.

The implementation safely checks for key existence before access and returns a pointer to avoid copying, aligning with CPython's memory management requirements.

@LinZhihao-723 LinZhihao-723 merged commit eb59339 into y-scope:main Jan 26, 2025
73 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants