-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Treat UTF-16 surrogate pairs as single characters for string min/maxLength #88
The head ref may contain hidden characters: "\u{1F4A9}"
Conversation
} | ||
|
||
fn json_simple_string(&mut self) -> NodeRef { | ||
self.lexeme(&format!("\"{}*\"", CHAR_REGEX)) | ||
self.lexeme("(?s:.*)", true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this change is orthogonal to the underlying PR -- @mmoskal if using the CHAR_REGEX
directly is marginally more performant, I can switch it back.
))) | ||
Ok(self.lexeme( | ||
&format!( | ||
"(?s:.{{{},{}}})", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
derivre
seems smart enough to match \uD83D\uDCA9
with .
, so length is counted appropriately
The JSON-quoted derivre strings do not allow It's all the same as far as JSON goes (when you read JSON Added comment here microsoft/derivre@6062cef Some general notes (from you-know-who): UTF-8 in JSON Surrogate Pairs in JSON |
Oh and BTW the test will pass if do |
@mmoskal thanks for taking a close look at this. I have to admit that all of this encoding stuff is still very far from intuitive to me... RE: being self-consistent... I would, as a user, expect
to produce an identical constraint to
Does this imply that we should use my solution but change the JSON quoting semantics in derivre? Or something else? |
A JSON object Now, the model may want to generate Our current unconstrained string regex allows for both, while derivre JSON-quoted only allows for (The whole discussion is rather independent of surrogate pairs, it just so happens ASCII JSON encoding of U+1F4A9 is "\uD83D\uDCA9" which looks like two codepoints at first glance but is one). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rename to "disallow surrogate pairs in strings with min/maxLength"
Makes the following test pass: