Releases: cometkim/unicode-segmenter
[email protected]
Minor Changes
- f1a43ff: Cleanup mixed use of
takeCodePoint
andString.prototype.codePointAt
grapheme
: UseString.prototype.codePointAt
grapheme
: Optimize character length checking, also reduce the size a bitutils
: AddisBMP
andisSMP
util to check a codepoint number is being BMP(Basic Multilingual Plane) rangeutils
: DeprecatedtakeCodePoint
andtakeChar
in favor of ES6String.prototype.codePointAt
andString.fromCodePoint
utils
:takeChar
is no longer depends onString.fromCodePoint
internally
[email protected]
Patch Changes
- 03e121c: Optimize grapheme cluster boundary check
[email protected]
[email protected]
Minor Changes
- 06159a4: Fix ESM module resolution, and make ESM-first (still support CommonJS by condition)
[email protected]
Minor Changes
-
e2c9e1d: Optimize perf again 🔥
It can be still getting faster, why not?
Through seriously thoughtful micro-optimizations, it has achieved performance improvements of up to ~30% (404ns -> 310ns) in the previously used simple emoji joiner test.
Now it use more realistic benchmark with various types of input text. In most cases,
unicode-segmenter
is 7~15 times faster than other competing libraries.For example, here a Tweet-like text ChatGPT generated:
🚀 새로운 유니코드 분할기 라이브러리 \'unicode-segmenter\'를 소개합니다! 🔍 각종 언어의 문자를 정확하게 구분해주는 강력한 도구입니다. Check it out! 👉 [https://github.com/cometkim/unicode-segmenter] #Unicode #Programming 🌐
And the result then:
cpu: Apple M1 Pro runtime: node v21.7.1 (arm64-darwin) time (avg) (min … max) p75 p99 p999 --------------------------------------------------------- ----------------------------- unicode-segmenter 7'850 ns/iter (7'753 ns … 8'122 ns) 7'877 ns 8'079 ns 8'122 ns Intl.Segmenter 60'581 ns/iter (57'916 ns … 405 µs) 59'167 ns 66'458 ns 358 µs graphemer 66'303 ns/iter (64'708 ns … 287 µs) 65'500 ns 73'459 ns 206 µs grapheme-splitter 146 µs/iter (143 µs … 466 µs) 145 µs 157 µs 397 µs summary unicode-segmenter 7.72x faster than Intl.Segmenter 8.45x faster than graphemer 18.6x faster than grapheme-splitter
-
ab6787b: Make the Intl adapter's type definitions compatible with the original
-
f974448: - Rename
searchGrapheme
tosearchGraphemeCategory
, and deprecated old one.- Rename
Segmenter
definitions from grapheme module toGraphemeCategory
. - Remove
SearchResult<T>
, andGraphemeSearchResult
defnitions which are identical toCategorizedUnicodeRange<T>
. - Improve JSDoc comments to be more informative.
- Rename
-
dc62381: Add
takeCodePoint
util to avoid extraString.codePointAt()
Patch Changes
-
3ea5a2d: Optimized initial parsing time via compacting tables into JSON
See https://v8.dev/blog/cost-of-javascript-2019#json
and https://youtu.be/ff4fgQxPaO0 -
16d2028: - Fix
Intl.Segmenter
adapter type definitions to be 100% compatible with tslib- Implemented
Intl.Segmenter.prototype.resolvedOptions
.
But since the locale matcher is environment-specific,
the adapter returns input locale as-is, or fallback toen
.
- Implemented
[email protected]
Patch Changes
-
e0b910d: Fix
{Extend}
+{Extended_Pictographic}
sequenceCounterexample:
- '👩🦰👩👩👦👦🏳️🌈' -> 3 graphemes
Reported from eslint/eslint#18359
[email protected]
Patch Changes
- 77af2ac: Fix CommonJS module resolutions
[email protected]
Minor Changes
-
c74c6a0: Expose
/utils
entry with a helpful APItakeChar(input, cursor)
: take a utf8 character from given input by cursor
-
c3ceaa5: Add
countGrapheme
utility -
955814a: Expose some low-level APIs that might help other dependents
-
7592c3b: - New entries for Unicode's general and emoji properties
import { isLetter, // match w/ \p{L} isNumeric, // match w/ \p{N} isAlphabetic, // match w/ \p{Alphabetic} isAlphanumeric, // match w/ [\p{N}\p{Alphabetic}] } from "unicode-segmenter/general"; import { isEmoji, // match w/ \p{Extended_Pictographic} isEmojiPresentation, // match w/ \p{Emoji_Presentation} } from "unicode-segmenter/emoji";
-
Grapheme segementer now yields matched category to
_cat
field.
It will be useful when building custom matchers by the categorye.g. custom emoji matcher:
function* matchEmoji(str) { for (let { index, segment, input, _cat } of graphemeSegments(str)) { if (_cat === GraphemeCategory.Extended_Pictographic) { yield { emoji: segment, index }; } } }
-
-
7592c3b: Add more low-level utilities
isHighSurrogate
check if a UTF-16 code in high surrogateisLowSurragate
check if a UTF-16 code in low surrogatesurrogatePairToCodePoint
convert given surrogate pair to a Unicode code point
Patch Changes
[email protected]
Minor Changes
-
9938499: Getting 2x faster by optimizing hot path. Also with reduced bundle size
By casting Unicode chars to u32 in advance, all internal operations become 32-bit integer operations.
The previous version (v0.1.6) was
- 2.47x faster than Intl.Segmenter
- 2.68x faster than graphemer
- 4.95x faster than grapheme-splitter
Now it is
- 5.04x faster than Intl.Segmenter
- 5.52x faster than graphemer
- 9.83x faster than grapheme-splitter
Patch Changes
[email protected]
Patch Changes
- 18c7f44: Fix breaks on Unicode extended characters