Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<regex>: Correct characters not matched by special character dot #5192

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

muellerj2
Copy link
Contributor

@muellerj2 muellerj2 commented Dec 16, 2024

This corrects the set of characters the special character dot . does not match in a regular expression as specified in the ECMAScript and POSIX standards, and aligns our treatment of . with libstdc++ and libc++.

  • Adds U+2028 Line Separator and U+2029 Paragraph Separator as characters not matched by . in a wregex in ECMAScript mode. See the definition of . semantics in Section 22.2.2.7 of ECMAScript 14, which removes the line terminators from the set of matched characters, and the list of line terminators in Section 12.3. (Note that this links to a newer standard, but the set of unmatched characters has not been changed since ECMAScript 3. Furthermore, the C++ standard does not modify the interpretation of ..)
  • In all other modes, . matches all characters except NUL now. This is in accordance with Section 9.3.4 and Section 9.4.4 of the POSIX standard. (I contemplated whether a new line (LF) should not be matched in addition to or instead of NUL in grep or egrep mode, as that is what grep implementations tend to do. The POSIX standard only states that regular expressions cannot match LFs due to the way grep works, but does not explicitly modify the definition of . or regular expressions in general, so it is ambiguous on this question. Since libstdc++ and libc++ only exclude NUL from the set of characters matched by . in grep and egrep mode, I decided to align the set of unmatched characters with them.)

Note: Whether NUL should be matched in POSIX regular expressions is the subject of LWG-3603.

@muellerj2 muellerj2 requested a review from a team as a code owner December 16, 2024 14:26
@CaseyCarter CaseyCarter added the bug Something isn't working label Dec 17, 2024
@StephanTLavavej StephanTLavavej self-assigned this Dec 17, 2024
@StephanTLavavej StephanTLavavej added the regex Everyone's favorite header label Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regex Everyone's favorite header
Projects
Status: Initial Review
Development

Successfully merging this pull request may close these issues.

3 participants