Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Unify epsilon, positive start, positive end, and negative transitions into spontaneous transitions. #76

Open
wants to merge 512 commits into
base: main
Choose a base branch
from

Conversation

SharafMohamed
Copy link
Contributor

@SharafMohamed SharafMohamed commented Jan 13, 2025

References

  • Depends on PR#72.
  • To review in parallel with PR#72, diff against PR#72 locally. In the repo run:
git fetch upstream pull/72/head:pr-72
git fetch upstream pull/76/head:pr-76
git diff pr-72 pr-77

Description

  • Previously we distinguished between epsilon transitions, positive start transitions, negative start transition, and negative transitions. This distinction doesn't provide any theoretical or practical benefit.
  • Combining the different types of spontaneous transitions into a single type reduces code bloat, makes the code more intuitive, and allows for easier implementation of the following determinization PR.

Validation performed

  • Modified transition tests to match refactor.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a new Capture class to replace the existing Tag class.
    • Enhanced lexer functionality with improved capture and tag handling.
    • Added new methods for managing capture groups and identifiers.
    • Added a new SpontaneousTransition class for NFA transitions.
    • Added a new test file for validating Capture class functionality.
  • Refactoring

    • Replaced tag-based transitions with capture-based transitions.
    • Updated type handling for register, symbol, and tag identifiers.
    • Streamlined NFA and regex AST implementation.
  • Documentation

    • Updated comments and terminology from "tags" to "captures".
    • Improved code clarity and type safety.
  • Testing

    • Enhanced lexer and NFA test coverage.
    • Introduced structured initialization and validation methods for tests.

These changes improve the log surgeon library's type management and capture group handling while maintaining existing functionality.

Copy link

coderabbitai bot commented Jan 13, 2025

Walkthrough

This pull request introduces a comprehensive refactoring of the log surgeon's finite automata implementation, focusing on replacing the concept of "tags" with "captures". The changes span multiple files across the project, including CMakeLists.txt, source files, and test cases. The primary goal appears to be enhancing the clarity and functionality of capture group handling in regex and NFA processing, with modifications to type definitions, method signatures, and internal data structures.

Changes

File Change Summary
CMakeLists.txt Removed Tag.hpp, added Capture.hpp, restored PrefixTree.cpp, PrefixTree.hpp, and RegexAST.hpp
Lexer.hpp Added type aliases, new methods for capture and tag handling, updated member variables
Lexer.tpp Modified method signatures, enhanced rule and generation logic
LexicalRule.hpp Added get_captures() method
SchemaParser.cpp Replaced Tag with Capture, updated header inclusions
finite_automata/Capture.hpp Renamed Tag class to Capture, updated preprocessor guards
finite_automata/Dfa.hpp Added RegisterHandler member
finite_automata/Nfa.hpp Introduced UniqueIdGenerator, updated methods to use captures
finite_automata/NfaState.hpp Updated constructors and methods to use tag_id_t
finite_automata/RegisterHandler.hpp Added register_id_t type alias
tests/CMakeLists.txt Updated source file lists
tests/test-capture.cpp New test file for Capture class
tests/test-lexer.cpp Added initialization and scanning test functions

Sequence Diagram

sequenceDiagram
    participant Lexer
    participant Nfa
    participant RegexAST
    participant Capture

    Lexer->>Nfa: Create NFA with rules
    Nfa->>RegexAST: Extract captures
    RegexAST->>Capture: Generate unique capture IDs
    Nfa-->>Lexer: NFA with capture mappings
Loading

Possibly Related PRs

Suggested Reviewers

  • LinZhihao-723

Finishing Touches

  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (7)
src/log_surgeon/Lexer.tpp (4)

362-365: Pass symbol_id_t by value in add_rule method

Since symbol_id_t is likely a small type (e.g., an integer), passing it by value is more efficient than passing by const reference.

Apply this diff to update the parameter:

- void Lexer<TypedNfaState, TypedDfaState>::add_rule(
-         symbol_id_t const& var_id,
+ void Lexer<TypedNfaState, TypedDfaState>::add_rule(
+         symbol_id_t var_id,
          std::unique_ptr<finite_automata::RegexAST<TypedNfaState>> rule)

369-374: Pass symbol_id_t by value in get_rule method

For consistency and efficiency, consider passing symbol_id_t by value in the get_rule method.

Apply this diff to update the parameter:

- auto Lexer<TypedNfaState, TypedDfaState>::get_rule(symbol_id_t const var_id)
+ auto Lexer<TypedNfaState, TypedDfaState>::get_rule(symbol_id_t var_id)

381-395: Improve exception message for duplicate capture names

Including the duplicate capture name in the exception message enhances debugging by providing specific information about the error.

Apply this diff to enhance the exception message:

-             throw std::invalid_argument("`m_rules` contains capture names that are not unique.");
+             throw std::invalid_argument("Duplicate capture name detected: " + capture_name);

404-404: Offer assistance for DFA capture handling

The TODO comment notes that the DFA currently ignores captures, which might lead to incorrect lexing of patterns with capture groups.

Would you like assistance in updating the DFA implementation to properly handle captures? I can help develop a solution and open a new GitHub issue to track this task.

src/log_surgeon/Lexer.hpp (1)

131-169: Add documentation for the new getter methods.

While the methods are well-structured, they would benefit from documentation explaining:

  • The purpose of each method
  • The meaning of nullopt returns
  • Any preconditions or postconditions
tests/test-nfa.cpp (1)

47-47: Consider using std::move for rules.

The removal of std::move could lead to unnecessary copying. Consider restoring it:

-    ByteNfa const nfa{rules};
+    ByteNfa const nfa{std::move(rules)};
tests/test-lexer.cpp (1)

296-350: Comprehensive test coverage with room for expansion.

The test cases effectively cover both basic lexer functionality and capture groups. However, there's a TODO comment about adding tests for register-related functionality.

Would you like me to help implement the register-related tests once the determinization is implemented?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e08f728 and 0cc6c24.

📒 Files selected for processing (18)
  • CMakeLists.txt (1 hunks)
  • src/log_surgeon/Lexer.hpp (5 hunks)
  • src/log_surgeon/Lexer.tpp (3 hunks)
  • src/log_surgeon/LexicalRule.hpp (1 hunks)
  • src/log_surgeon/SchemaParser.cpp (4 hunks)
  • src/log_surgeon/finite_automata/Capture.hpp (2 hunks)
  • src/log_surgeon/finite_automata/Dfa.hpp (3 hunks)
  • src/log_surgeon/finite_automata/Nfa.hpp (3 hunks)
  • src/log_surgeon/finite_automata/NfaState.hpp (3 hunks)
  • src/log_surgeon/finite_automata/PrefixTree.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (21 hunks)
  • src/log_surgeon/finite_automata/RegisterHandler.hpp (2 hunks)
  • src/log_surgeon/finite_automata/TaggedTransition.hpp (3 hunks)
  • tests/CMakeLists.txt (2 hunks)
  • tests/test-capture.cpp (1 hunks)
  • tests/test-lexer.cpp (5 hunks)
  • tests/test-nfa.cpp (4 hunks)
  • tests/test-tag.cpp (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/test-tag.cpp
✅ Files skipped from review due to trivial changes (1)
  • src/log_surgeon/finite_automata/PrefixTree.hpp
🧰 Additional context used
📓 Path-based instructions (13)
src/log_surgeon/finite_automata/RegisterHandler.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/LexicalRule.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-nfa.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/SchemaParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-lexer.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Capture.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Dfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/NfaState.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/Lexer.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Nfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegexAST.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-capture.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/TaggedTransition.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (7)
src/log_surgeon/finite_automata/RegisterHandler.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T22:25:35.608Z
Learning: In the `RegisterHandler` class in `src/log_surgeon/finite_automata/RegisterHandler.hpp`, the methods `add_register` and `append_position` rely on `emplace_back` and `m_prefix_tree.insert` to handle exceptions correctly and maintain consistent state without requiring additional exception handling.
src/log_surgeon/SchemaParser.cpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
src/log_surgeon/finite_automata/Capture.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
src/log_surgeon/finite_automata/NfaState.hpp (3)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:127-128
Timestamp: 2024-11-10T16:46:58.543Z
Learning: `RegexNFAUTF8State` is defined as a type alias for `RegexNFAState<RegexNFAStateType::UTF8>`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/TaggedTransition.hpp:16-37
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `src/log_surgeon/finite_automata/TaggedTransition.hpp`, the classes `PositiveTaggedTransition` and `NegativeTaggedTransition` currently do not share enough functionality to justify refactoring into a common base class.
src/log_surgeon/Lexer.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:37-90
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In this codebase, prefer code clarity over efficiency optimizations unless efficiency is a critical concern.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T22:25:35.608Z
Learning: In the `RegisterHandler` class in `src/log_surgeon/finite_automata/RegisterHandler.hpp`, the methods `add_register` and `append_position` rely on `emplace_back` and `m_prefix_tree.insert` to handle exceptions correctly and maintain consistent state without requiring additional exception handling.
src/log_surgeon/finite_automata/RegexAST.hpp (3)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/TaggedTransition.hpp (3)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/TaggedTransition.hpp:16-37
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `src/log_surgeon/finite_automata/TaggedTransition.hpp`, the classes `PositiveTaggedTransition` and `NegativeTaggedTransition` currently do not share enough functionality to justify refactoring into a common base class.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: build (ubuntu-latest, Debug)
🔇 Additional comments (27)
src/log_surgeon/finite_automata/Nfa.hpp (1)

43-43: Good practice: Pass rules by const reference

Modifying the constructor to accept rules as a const& parameter avoids unnecessary copies and improves performance.

src/log_surgeon/finite_automata/RegexAST.hpp (4)

86-100: Efficient management of subtree positive captures

The addition of methods to get, set, and add m_subtree_positive_captures enhances the clarity and maintainability of capture tracking within the AST nodes.


650-660: Validation of non-null parameters in RegexASTCapture constructor

Ensuring that group_regex_ast and capture are not null by throwing std::invalid_argument improves robustness and aligns with previously learned best practices regarding non-null requirements for m_capture in RegexASTCapture.

This approach reflects the learning that m_tag (now m_capture) must always be non-null, as noted in past experiences.


788-790: Correct accumulation of positive captures in RegexASTOr

The constructor correctly accumulates subtree positive captures from both left and right operands, ensuring accurate tracking of captures in alternation expressions.


817-819: Accurate merging of captures in RegexASTCat

Combining subtree positive captures from both operands in concatenation expressions maintains the integrity of capture groups throughout the AST.

src/log_surgeon/finite_automata/Capture.hpp (2)

1-2: LGTM! Class and header guards renamed consistently.

The renaming from Tag to Capture has been consistently applied across the class name and header guards, which improves code clarity by better reflecting its purpose.

Also applies to: 9-9, 20-20


11-11: LGTM! Constructor implementation is correct.

The constructor correctly moves the name parameter into the member variable, which is an efficient approach for string handling.

tests/test-capture.cpp (1)

7-34: LGTM! Comprehensive test coverage with well-organized test cases.

The test suite effectively covers:

  • Basic functionality
  • Edge cases (empty names)
  • Special character handling
  • Copy and move semantics
src/log_surgeon/LexicalRule.hpp (1)

26-28: LGTM! Well-implemented getter with appropriate const correctness.

The method correctly:

  • Uses [[nodiscard]] to prevent accidental value discarding
  • Returns const pointers to prevent modification of captures
  • Delegates to the appropriate regex method
src/log_surgeon/finite_automata/RegisterHandler.hpp (2)

5-5: LGTM! Appropriate include for fixed-width integer types.

The addition of <cstdint> is necessary for using uint32_t in the type alias.


21-21: LGTM! Type alias improves code clarity and maintainability.

Using register_id_t instead of raw uint32_t enhances code readability and makes future type changes easier.

src/log_surgeon/finite_automata/TaggedTransition.hpp (3)

14-15: LGTM! Type alias improves code clarity.

The introduction of tag_id_t as std::uint32_t enhances code readability and ensures consistent tag ID representation across the codebase.


Line range hint 23-46: LGTM! Improved type safety with value types.

The refactoring from Tag pointers to tag_id_t values:

  • Reduces memory management complexity
  • Improves type safety
  • Aligns with the PR objective of simplifying transition handling

Line range hint 55-78: LGTM! Consistent refactoring approach.

The changes to NegativeTaggedTransition mirror those in PositiveTaggedTransition, maintaining consistency in the transition handling approach.

src/log_surgeon/finite_automata/Dfa.hpp (2)

Line range hint 5-45: LGTM! Enhanced register handling capabilities.

The addition of RegisterHandler and supporting includes improves the DFA's register management capabilities.


82-82: LGTM! Explicit pointer initialization improves safety.

Initializing the state pointer to nullptr prevents potential undefined behavior.

src/log_surgeon/Lexer.hpp (3)

27-29: LGTM! Well-defined type system.

The introduction of clear type aliases enhances code readability and maintains consistency with the codebase's type system.


59-61: LGTM! Clear error documentation.

The updated documentation clearly specifies the exception condition for duplicate capture names.


197-199: LGTM! Efficient ID mapping implementation.

The use of unordered_map provides efficient lookups and aligns with the transition to an ID-based system.

tests/test-nfa.cpp (1)

Line range hint 50-138: LGTM! Test expectations updated correctly.

The test expectations have been properly updated to reflect the new tag ID system while maintaining comprehensive coverage of different transition types.

src/log_surgeon/finite_automata/NfaState.hpp (3)

35-40: LGTM! Constructor changes improve efficiency.

The switch from Tag const* to tag_id_t for constructor parameters is a good change that:

  • Reduces pointer indirection
  • Aligns with the PR's goal of unifying transitions
  • Simplifies memory management

54-56: LGTM! Method signature change maintains consistency.

The update to use tag_id_t in add_positive_tagged_start_transition maintains consistency with the constructor changes.


Line range hint 205-251: Consider documenting the epsilon transition behaviour.

The TODO comment "currently treat tagged transitions as epsilon transitions" suggests this might be temporary. Consider:

  1. Documenting this behaviour in the class documentation
  2. Evaluating if this should remain the long-term approach
tests/test-lexer.cpp (1)

58-72: Well-structured helper functions with clear documentation.

The new helper functions initialize_lexer and test_scanning_input improve test maintainability and readability.

src/log_surgeon/SchemaParser.cpp (1)

12-12: LGTM! Consistent transition from Tag to Capture.

The changes properly implement the transition from Tag to Capture, maintaining consistency across the codebase.

Also applies to: 170-170

tests/CMakeLists.txt (2)

5-7: LGTM! Source files properly updated.

The addition of Capture.hpp and related headers maintains proper build dependencies.


27-34: LGTM! Test sources well-organized.

The addition of test-capture.cpp and the improved formatting enhance maintainability.

src/log_surgeon/finite_automata/Nfa.hpp Show resolved Hide resolved
CMakeLists.txt Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (1)
src/log_surgeon/finite_automata/RegexAST.hpp (1)

Line range hint 89-850: Format the code using clang-format.

Multiple code formatting violations were detected by the linter. Please run clang-format on the file to ensure it adheres to the project's style guide.

♻️ Duplicate comments (1)
src/log_surgeon/finite_automata/Nfa.hpp (1)

20-28: ⚠️ Potential issue

Ensure thread safety of UniqueIdGenerator

The current_id counter is not thread-safe. If instances of Nfa are accessed by multiple threads, consider using std::atomic<uint32_t> for current_id to prevent data races.

 private:
-    uint32_t current_id;
+    std::atomic<uint32_t> current_id;
🧹 Nitpick comments (6)
src/log_surgeon/finite_automata/SpontaneousTransition.hpp (4)

15-19: Add documentation for TransitionOperation enum values

Each enum value's purpose should be documented to improve code maintainability.

Add documentation like this:

 enum class TransitionOperation {
+    /// No operation to perform on tags
     None,
+    /// Set the specified tags as active
     SetTags,
+    /// Set the specified tags as inactive
     NegateTags
 };

62-64: Make member variables const

Since these members are never modified after construction, they should be marked as const.

Apply this diff:

-    TransitionOperation m_transition_op;
-    std::vector<tag_id_t> m_tag_ids;
-    TypedNfaState const* m_dest_state;
+    const TransitionOperation m_transition_op;
+    const std::vector<tag_id_t> m_tag_ids;
+    TypedNfaState const* const m_dest_state;

52-59: Enhance serialize() implementation

The current implementation has two potential improvements:

  1. Include the transition operation in the output for better debugging and visualization
  2. Handle empty tag_ids vector case specially

Consider this implementation:

     [[nodiscard]] auto serialize(std::unordered_map<TypedNfaState const*, uint32_t> const& state_ids
     ) const -> std::optional<std::string> {
         auto const state_id_it = state_ids.find(m_dest_state);
         if (state_id_it == state_ids.end()) {
             return std::nullopt;
         }
-        return fmt::format("{}[{}]", state_id_it->second, fmt::join(m_tag_ids, ","));
+        auto const op_str = m_transition_op == TransitionOperation::None ? ""
+                         : m_transition_op == TransitionOperation::SetTags ? "+"
+                         : "-";
+        auto const tags_str = m_tag_ids.empty() ? ""
+                           : fmt::format("[{}]", fmt::join(m_tag_ids, ","));
+        return fmt::format("{}{}{}", state_id_it->second, op_str, tags_str);
     }

26-26: Add comparison operators for container operations

Consider adding operator== and operator<=> to enable using this class in ordered containers and algorithms.

Add these operators:

    auto operator<=>(const SpontaneousTransition&) const = default;
    bool operator==(const SpontaneousTransition&) const = default;
src/log_surgeon/finite_automata/NfaState.hpp (1)

48-49: Add parameter validation

The method should validate that dest_state is not null before adding it to the transitions.

     auto add_spontaneous_transition(NfaState* dest_state) -> void {
+        if (nullptr == dest_state) {
+            return;
+        }
         m_spontaneous_transitions.emplace_back(dest_state);
     }
src/log_surgeon/finite_automata/Nfa.hpp (1)

110-111: Consider implementing scoped capture groups

The TODO comment suggests that unique naming across capture groups limits use cases. Consider implementing a scoping mechanism to allow reuse of capture names within different scopes.

Would you like me to propose a design for implementing scoped capture groups?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0cc6c24 and 6653007.

📒 Files selected for processing (9)
  • CMakeLists.txt (1 hunks)
  • src/log_surgeon/LexicalRule.hpp (2 hunks)
  • src/log_surgeon/finite_automata/Nfa.hpp (5 hunks)
  • src/log_surgeon/finite_automata/NfaState.hpp (6 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (19 hunks)
  • src/log_surgeon/finite_automata/SpontaneousTransition.hpp (1 hunks)
  • src/log_surgeon/finite_automata/TaggedTransition.hpp (0 hunks)
  • tests/CMakeLists.txt (2 hunks)
  • tests/test-nfa.cpp (5 hunks)
💤 Files with no reviewable changes (1)
  • src/log_surgeon/finite_automata/TaggedTransition.hpp
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/log_surgeon/LexicalRule.hpp
  • tests/test-nfa.cpp
  • CMakeLists.txt
  • tests/CMakeLists.txt
🧰 Additional context used
📓 Path-based instructions (4)
src/log_surgeon/finite_automata/SpontaneousTransition.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Nfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/NfaState.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegexAST.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (3)
src/log_surgeon/finite_automata/SpontaneousTransition.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/TaggedTransition.hpp:16-37
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `src/log_surgeon/finite_automata/TaggedTransition.hpp`, the classes `PositiveTaggedTransition` and `NegativeTaggedTransition` currently do not share enough functionality to justify refactoring into a common base class.
src/log_surgeon/finite_automata/NfaState.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/RegexAST.hpp (3)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
🪛 GitHub Actions: lint
src/log_surgeon/finite_automata/Nfa.hpp

[error] 224-224: Code formatting violation detected. Line needs to be formatted using clang-format.

src/log_surgeon/finite_automata/NfaState.hpp

[error] 198-204: Code formatting violations detected. Multiple lines need to be formatted using clang-format.

src/log_surgeon/finite_automata/RegexAST.hpp

[error] 89-850: Multiple code formatting violations detected throughout the file. File needs to be formatted using clang-format.

🔇 Additional comments (9)
src/log_surgeon/finite_automata/RegexAST.hpp (7)

23-23: LGTM! Documentation accurately reflects the tag-to-capture transition.

The include directive and class documentation have been properly updated to reflect the architectural change from tags to captures.

Also applies to: 33-37


85-99: LGTM! Member functions properly handle capture-based operations.

The implementation correctly manages capture-related operations, and the condition on line 116 follows the coding guideline of using false == <expression>.

Also applies to: 103-104, 112-120


132-148: LGTM! Serialization methods effectively handle capture formatting.

The implementation makes good use of modern C++ features like ranges and format for serializing captures.


Line range hint 632-658: LGTM! RegexASTCapture implementation maintains non-null capture invariant.

The implementation correctly enforces the non-null requirement for captures, which aligns with the previously established invariant from PR #48.

Also applies to: 659-662, 670-671


784-788: LGTM! RegexASTOr correctly handles captures in alternation.

The implementation properly manages captures and their negations in alternation patterns.

Also applies to: 794-795


815-817: LGTM! RegexASTCat properly manages captures in concatenation.

The implementation correctly handles capture propagation in concatenated patterns.

Also applies to: 825-825, 827-827


850-851: LGTM! RegexASTMultiplication effectively manages captures in repetition patterns.

The implementation properly handles captures in various repetition scenarios (zero, finite, and infinite).

Also applies to: 861-885

src/log_surgeon/finite_automata/NfaState.hpp (2)

171-175: Address TODO comment regarding tag modifications

The TODO comment indicates that only transitions without tag modifications are handled. This limitation should be addressed to ensure proper functionality.

Would you like me to help implement the handling of transitions that modify tags?


198-204: ⚠️ Potential issue

Fix code formatting

The code formatting violates the project's style guidelines. Apply proper formatting to these lines.

-        auto const optional_serialized_transition
-                = spontaneous_transition.serialize(state_ids);
-        if(false == optional_serialized_transition.has_value()) {
+        auto const optional_serialized_transition = spontaneous_transition.serialize(state_ids);
+        if (false == optional_serialized_transition.has_value()) {

Likely invalid or redundant comment.

🧰 Tools
🪛 GitHub Actions: lint

[error] 198-204: Code formatting violations detected. Multiple lines need to be formatted using clang-format.

src/log_surgeon/finite_automata/SpontaneousTransition.hpp Outdated Show resolved Hide resolved
src/log_surgeon/finite_automata/NfaState.hpp Outdated Show resolved Hide resolved
src/log_surgeon/finite_automata/Nfa.hpp Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6653007 and 571877e.

📒 Files selected for processing (5)
  • src/log_surgeon/finite_automata/Nfa.hpp (5 hunks)
  • src/log_surgeon/finite_automata/NfaState.hpp (6 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (19 hunks)
  • src/log_surgeon/finite_automata/SpontaneousTransition.hpp (1 hunks)
  • tests/test-nfa.cpp (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/test-nfa.cpp
  • src/log_surgeon/finite_automata/SpontaneousTransition.hpp
🧰 Additional context used
📓 Path-based instructions (3)
src/log_surgeon/finite_automata/Nfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/NfaState.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegexAST.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (2)
src/log_surgeon/finite_automata/NfaState.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/RegexAST.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
🔇 Additional comments (12)
src/log_surgeon/finite_automata/RegexAST.hpp (4)

23-23: LGTM! Documentation and includes updated to reflect the transition from tags to captures.

The changes accurately reflect the architectural shift from tag-based to capture-based implementation.

Also applies to: 33-37


85-102: LGTM! Base class changes maintain functionality while transitioning to captures.

The changes consistently replace tag-based operations with capture-based ones, maintaining the same functionality and following coding guidelines.

Also applies to: 110-118, 130-146, 151-153


630-632: LGTM! RegexASTCapture changes maintain the non-null invariant.

The changes correctly transition from tags to captures while maintaining the critical invariant that the capture pointer must never be null, as enforced by the constructor's null checks.

Also applies to: 641-658, 666-667, 707-707, 716-716


780-783: LGTM! Derived classes updated consistently.

The changes systematically replace tag operations with capture operations across all derived classes. The updated documentation in RegexASTCapture::add_to_nfa provides an excellent visualization of the NFA structure.

Also applies to: 789-790, 810-811, 819-819, 821-821, 854-878, 906-926, 932-932, 936-936, 942-949

src/log_surgeon/finite_automata/NfaState.hpp (4)

16-21: LGTM! Clean header organization and type aliases.

The addition of SpontaneousTransition.hpp and the type aliases for ByteNfaState and Utf8NfaState improve code readability.


35-40: LGTM! Clear constructor implementation.

The constructor properly initializes the spontaneous transitions with the provided parameters.


48-57: LGTM! Well-designed method overloads.

The two overloads of add_spontaneous_transition provide good flexibility for adding transitions with or without operations and tags.


196-202: LGTM! Robust error handling.

The serialization implementation properly handles failures using std::optional.

src/log_surgeon/finite_automata/Nfa.hpp (4)

126-136: LGTM! Efficient tag management.

The method efficiently handles tag creation and caching using the unique ID generator.


146-161: LGTM! Clear negative capture handling.

The method properly manages negative transitions and tag creation for each capture.


166-179: LGTM! Well-structured capture state creation.

The method properly creates and links start and end states with appropriate tag operations.


20-28: 🛠️ Refactor suggestion

Consider making UniqueIdGenerator thread-safe.

The current implementation uses a non-atomic increment operation which could lead to race conditions if used in a multi-threaded context.

 class UniqueIdGenerator {
 public:
     UniqueIdGenerator() : current_id{0} {}

-    [[nodiscard]] auto generate_id() -> uint32_t { return current_id++; }
+    [[nodiscard]] auto generate_id() -> uint32_t {
+        return std::atomic_fetch_add(&current_id, 1u);
+    }

 private:
-    uint32_t current_id;
+    std::atomic<uint32_t> current_id;
 };

Likely invalid or redundant comment.

src/log_surgeon/finite_automata/NfaState.hpp Outdated Show resolved Hide resolved
src/log_surgeon/finite_automata/Nfa.hpp Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/log_surgeon/finite_automata/Nfa.hpp (1)

166-180: Consider more efficient vector initialization.

The implementation is correct, but the vector initialization for end_tag could be more efficient.

-            std::vector{end_tag},
+            std::vector<tag_id_t>{end_tag},
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 571877e and d5acb49.

📒 Files selected for processing (4)
  • src/log_surgeon/finite_automata/Nfa.hpp (5 hunks)
  • src/log_surgeon/finite_automata/NfaState.hpp (6 hunks)
  • src/log_surgeon/finite_automata/SpontaneousTransition.hpp (1 hunks)
  • tests/test-nfa.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/log_surgeon/finite_automata/SpontaneousTransition.hpp
  • tests/test-nfa.cpp
🧰 Additional context used
📓 Path-based instructions (2)
src/log_surgeon/finite_automata/Nfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/NfaState.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
src/log_surgeon/finite_automata/NfaState.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
🔇 Additional comments (8)
src/log_surgeon/finite_automata/NfaState.hpp (4)

35-40: LGTM! Efficient constructor implementation.

The constructor correctly uses std::move for tag_ids parameter and initializes m_spontaneous_transitions with the provided parameters.


48-57: LGTM! Well-designed transition methods.

Both overloads of add_spontaneous_transition are well-implemented, with proper parameter handling and efficient use of std::move.


171-174: Consider transition operations in epsilon closure.

The epsilon_closure method currently adds all spontaneous transitions to the closure set without checking their TransitionOperation. This might lead to incorrect closure calculation if some transitions modify tags.

Consider filtering transitions based on their operation type or documenting the intended behaviour.


193-207: LGTM! Robust serialization implementation.

The serialize method properly handles serialization failures and maintains a clear format for the output.

src/log_surgeon/finite_automata/Nfa.hpp (4)

126-136: LGTM! Clean implementation of tag management.

The method efficiently handles both existing and new tags, with good use of structured bindings and clear control flow.


146-162: LGTM! Well-structured negative capture handling.

The method properly collects both start and end tags for each capture and efficiently creates the new state with the NegateTags operation.


Line range hint 216-233: LGTM! Robust serialization with proper error handling.

The method correctly handles serialization failures and maintains a clear format for the output.


20-28: ⚠️ Potential issue

Make UniqueIdGenerator thread-safe.

The current implementation uses a non-atomic increment operation which isn't thread-safe. Consider using std::atomic<uint32_t> for current_id to prevent data races in multi-threaded contexts.

-    uint32_t current_id;
+    std::atomic<uint32_t> current_id;

Likely invalid or redundant comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant