Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat UTF-16 surrogate pairs as single characters for string min/maxLength #88

Merged
merged 3 commits into from
Dec 13, 2024

Conversation

hudson-ai
Copy link
Collaborator

Makes the following test pass:

  {
      "description": "minLength validation",
      "schema": {
          "$schema": "https://json-schema.org/draft/2020-12/schema",
          "minLength": 2
      },
      "tests": [
          {
              "description": "one grapheme is not long enough",
              "data": "\uD83D\uDCA9",
              "valid": false
          }
      ]
  }

@hudson-ai hudson-ai requested a review from mmoskal December 10, 2024 23:25
}

fn json_simple_string(&mut self) -> NodeRef {
self.lexeme(&format!("\"{}*\"", CHAR_REGEX))
self.lexeme("(?s:.*)", true)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this change is orthogonal to the underlying PR -- @mmoskal if using the CHAR_REGEX directly is marginally more performant, I can switch it back.

)))
Ok(self.lexeme(
&format!(
"(?s:.{{{},{}}})",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

derivre seems smart enough to match \uD83D\uDCA9 with ., so length is counted appropriately

@mmoskal
Copy link
Collaborator

mmoskal commented Dec 11, 2024

The JSON-quoted derivre strings do not allow \uxxxx escapes except for \x00xx. The assumption is that everything else will be emitted as UTF-8.

It's all the same as far as JSON goes (when you read JSON {"x":"ł"} or {"x":"\u0142"} you get exactly the same object in memory), but I don't know if we want to let the model output the general \uxxxx sequences in non-regex case. I would say not just to be consistent. If we do allow it, we would also need to check if the surrogate pairs with \xD8xx are output correctly (that is the high surrogate only ever occurs before low surrogate and vice versa).

Added comment here microsoft/derivre@6062cef

Some general notes (from you-know-who):

UTF-8 in JSON
• JSON strings must be encoded in UTF-8 by default.
• Any Unicode character can appear directly in a JSON string if the encoding supports it, such as "\u1234" for U+1234.
• Non-ASCII characters are often escaped for compatibility using Unicode escapes (\u).

Surrogate Pairs in JSON
• Surrogate pairs are used to represent characters outside the Basic Multilingual Plane (BMP) (U+10000 to U+10FFFF) in UTF-16.
• A surrogate pair consists of two code units:
• High surrogate: \uD800 to \uDBFF.
• Low surrogate: \uDC00 to \uDFFF.
• These pairs combine to encode a single Unicode code point.
• E.g., the code point U+1F600 (😀) is encoded as \uD83D\uDE00.

@mmoskal
Copy link
Collaborator

mmoskal commented Dec 11, 2024

Oh and BTW the test will pass if do json.dumps(..., ensure_ascii=False)

@hudson-ai
Copy link
Collaborator Author

hudson-ai commented Dec 12, 2024

@mmoskal thanks for taking a close look at this. I have to admit that all of this encoding stuff is still very far from intuitive to me...

RE: being self-consistent... I would, as a user, expect

{"minLength": X, "maxLength": Y}

to produce an identical constraint to

{"pattern": "(?s:.{X,Y})"}

Does this imply that we should use my solution but change the JSON quoting semantics in derivre? Or something else?

@mmoskal
Copy link
Collaborator

mmoskal commented Dec 13, 2024

A JSON object {"x":"ł"} can be also encoded as {"x":"\u0142"}. It is the same JSON object when read into memory, just a different serialization format (in Python it's the ensure_ascii option, which defaults to False (\u0142); most other libraries/languages AFAIK default to ł as it is shorter and easier to read). In some sense this is similar to { "x" : "\u0142" } (with spaces) also being the same object.

Now, the model may want to generate ł or \u0142; I don't know which one it likes more, but my money would be on ł, Who knows if the test data was passed through Python with default settings at some point, but I suspect not.

Our current unconstrained string regex allows for both, while derivre JSON-quoted only allows for ł. I would say let's merge this and if we find that models like \u0142, we'll change derivre.

(The whole discussion is rather independent of surrogate pairs, it just so happens ASCII JSON encoding of U+1F4A9 is "\uD83D\uDCA9" which looks like two codepoints at first glance but is one).

Copy link
Collaborator

@mmoskal mmoskal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename to "disallow surrogate pairs in strings with min/maxLength"

@hudson-ai hudson-ai merged commit 011fe3b into guidance-ai:main Dec 13, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants