-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
properties: add "ambiwidth" property for ambiguous East Asian Width #270
Conversation
If this is font-dependent, it doesn't seem like something you can infer from codepoint alone? I'm a little confused about how people would use this new property in practice. |
Sure but it is font dependent either way. currently utf8proc represents all (non-zero) ambiguous width chars as single width, which is a fine first approximation but not guaranteed to be correct either. Knowing which chars are considered to be ambiguous allows apps to treat these more carefully, i e in a TUI you could reposition the cursor after each such codepoint to make sure the TUI and terminal emulator cursors are in sync regardless of the actual width in the user's font. More specifically, this was motivated by ongoing work in neovim to migrate all unicode table lookups to use utf8proc, and ambiguous EAW is something we need to know in order to not regress functionality. Whether these chars are seen as single- or double-width is configurable as an option, and regardless we do the workaround described above to handle discrepancies in fonts. |
This is an example how this property will be used in neovim: neovim/neovim#30042 . |
@stevengj any input? This is a bit of a blocker for us. |
Seems fine to me; can you add an accessor function to the API? e.g. |
…idth Some characters have their width defined as "Ambiguous" in UAX#11. These are typically rendered as single-width by modern monospace fonts, and utf8proc correctly returns charwidth==1 for these. However some applications might need to support older CJK fonts where characters which where two-byte in legacy encodings were rendered as double-width. An example of this is the 'ambiwidth' option of vim and neovim which supports rendering in terminals using such wideness rules. Add an 'ambiguous_width' property to utf8proc_property_t for such characters.
done. |
Note that Unicode 16 looks like it is scheduled to be released on September 10, so it might be good to hold off on a new release for a couple of weeks until we can update the Unicode tables. |
3 months ping. could we release? 👀 @stevengj |
Some characters have their width defined as "Ambiguous" in UAX#11. These are typically rendered as single-width by modern monospace fonts, and utf8proc correctly returns charwidth==1 for these.
However some applications might need to support older CJK fonts where two-byte characters in legacy encodings were rendered as double-width. An example of this is the 'ambiwidth' option of vim and neovim which supports rendering in terminals using such wideness rules.
Add an 'ambiwidth' property to utf8proc_property_t for such characters, by using a previously unused padding bit.
alternatives
set
charwidth==3
for such characters (which are not zero-width), which is presently unused. Would be too much of a breaking change for existing consumers, I think.return the full set of EAW classes (W, F, N, H, Na, A). Could be more future-proof if some consumers need this info, but would require more space usage.