Releases · cometkim/unicode-segmenter

16 May 03:16

github-actions

[email protected]

954dc8e

[email protected]

Minor Changes

f1a43ff: Cleanup mixed use of takeCodePoint and String.prototype.codePointAt
- grapheme: Use String.prototype.codePointAt
- grapheme: Optimize character length checking, also reduce the size a bit
- utils: Add isBMP and isSMP util to check a codepoint number is being BMP(Basic Multilingual Plane) range
- utils: Deprecated takeCodePoint and takeChar in favor of ES6 String.prototype.codePointAt and String.fromCodePoint
- utils: takeChar is no longer depends on String.fromCodePoint internally

Assets 2

12 May 02:40

github-actions

[email protected]

ca5b4e3

[email protected]

Patch Changes

03e121c: Optimize grapheme cluster boundary check

Assets 2

12 May 01:25

github-actions

[email protected]

fd729cf

[email protected]

Minor Changes

04455e0: Implement GB9c rule from Unicode® Standard Annex #29
f9d3dd1: Hide the internal fields of the Intl adapter to prevent auto-completion

Assets 2

07 May 04:01

github-actions

[email protected]

9f3074d

[email protected]

Minor Changes

06159a4: Fix ESM module resolution, and make ESM-first (still support CommonJS by condition)

Assets 2

20 Apr 12:57

github-actions

[email protected]

421d44d

[email protected]

Minor Changes

e2c9e1d: Optimize perf again 🔥

It can be still getting faster, why not?

Through seriously thoughtful micro-optimizations, it has achieved performance improvements of up to ~30% (404ns -> 310ns) in the previously used simple emoji joiner test.

Now it use more realistic benchmark with various types of input text. In most cases, unicode-segmenter is 7~15 times faster than other competing libraries.

For example, here a Tweet-like text ChatGPT generated:

🚀 새로운 유니코드 분할기 라이브러리 \'unicode-segmenter\'를 소개합니다! 🔍 각종 언어의 문자를 정확하게 구분해주는 강력한 도구입니다. Check it out! 👉 [https://github.com/cometkim/unicode-segmenter] #Unicode #Programming 🌐

And the result then:

cpu: Apple M1 Pro
runtime: node v21.7.1 (arm64-darwin)

                       time (avg)             (min … max)       p75       p99      p999
--------------------------------------------------------- -----------------------------
unicode-segmenter   7'850 ns/iter   (7'753 ns … 8'122 ns)  7'877 ns  8'079 ns  8'122 ns
Intl.Segmenter     60'581 ns/iter    (57'916 ns … 405 µs) 59'167 ns 66'458 ns    358 µs
graphemer          66'303 ns/iter    (64'708 ns … 287 µs) 65'500 ns 73'459 ns    206 µs
grapheme-splitter     146 µs/iter       (143 µs … 466 µs)    145 µs    157 µs    397 µs

summary
  unicode-segmenter
   7.72x faster than Intl.Segmenter
   8.45x faster than graphemer
   18.6x faster than grapheme-splitter

ab6787b: Make the Intl adapter's type definitions compatible with the original
f974448: - Rename searchGrapheme to searchGraphemeCategory, and deprecated old one.
- Rename Segmenter definitions from grapheme module to GraphemeCategory.
- Remove SearchResult<T>, and GraphemeSearchResult defnitions which are identical to CategorizedUnicodeRange<T>.
- Improve JSDoc comments to be more informative.
dc62381: Add takeCodePoint util to avoid extra String.codePointAt()

Patch Changes

3ea5a2d: Optimized initial parsing time via compacting tables into JSON

See https://v8.dev/blog/cost-of-javascript-2019#json
and https://youtu.be/ff4fgQxPaO0
16d2028: - Fix Intl.Segmenter adapter type definitions to be 100% compatible with tslib
- Implemented Intl.Segmenter.prototype.resolvedOptions.
  But since the locale matcher is environment-specific,
  the adapter returns input locale as-is, or fallback to en.

Assets 2

18 Apr 04:50

github-actions

[email protected]

8189851

[email protected]

Patch Changes

e0b910d: Fix {Extend}+{Extended_Pictographic} sequence

Counterexample:
- '👩‍🦰👩‍👩‍👦‍👦🏳️‍🌈' -> 3 graphemes
Reported from eslint/eslint#18359

Assets 2

18 Apr 03:36

github-actions

[email protected]

0dbaec7

[email protected]

Patch Changes

77af2ac: Fix CommonJS module resolutions

Assets 2

18 Apr 03:18

github-actions

[email protected]

0b622ee

[email protected]

Minor Changes

c74c6a0: Expose /utils entry with a helpful API
- takeChar(input, cursor): take a utf8 character from given input by cursor
c3ceaa5: Add countGrapheme utility
955814a: Expose some low-level APIs that might help other dependents

7592c3b: - New entries for Unicode's general and emoji properties

import {
  isLetter, // match w/ \p{L}
  isNumeric, // match w/ \p{N}
  isAlphabetic, // match w/ \p{Alphabetic}
  isAlphanumeric, // match w/ [\p{N}\p{Alphabetic}]
} from "unicode-segmenter/general";

import {
  isEmoji, // match w/ \p{Extended_Pictographic}
  isEmojiPresentation, // match w/ \p{Emoji_Presentation}
} from "unicode-segmenter/emoji";

Grapheme segementer now yields matched category to _cat field.
It will be useful when building custom matchers by the category

e.g. custom emoji matcher:

function* matchEmoji(str) {
  for (let { index, segment, input, _cat } of graphemeSegments(str)) {
    if (_cat === GraphemeCategory.Extended_Pictographic) {
      yield { emoji: segment, index };
    }
  }
}

7592c3b: Add more low-level utilities
- isHighSurrogate check if a UTF-16 code in high surrogate
- isLowSurragate check if a UTF-16 code in low surrogate
- surrogatePairToCodePoint convert given surrogate pair to a Unicode code point

Patch Changes

7592c3b: Correct some type definitions
900f959: Optimize perf again
3db955b: Fix edge cases around ZWJ

Assets 2

16 Apr 16:07

github-actions

[email protected]

0909f1c

[email protected]

Minor Changes

9938499: Getting 2x faster by optimizing hot path. Also with reduced bundle size

By casting Unicode chars to u32 in advance, all internal operations become 32-bit integer operations.

The previous version (v0.1.6) was
- 2.47x faster than Intl.Segmenter
- 2.68x faster than graphemer
- 4.95x faster than grapheme-splitter
Now it is
- 5.04x faster than Intl.Segmenter
- 5.52x faster than graphemer
- 9.83x faster than grapheme-splitter

Patch Changes

b6824b5: Mark sideEffects on the polyfill bundle
7c68863: Reduce bundle size a bit by inlining internal constants, and removing unused insternal state.
9938499: Reduce bundle size a bit more
f1c80b7: Publish sourcemaps

Assets 2

15 Apr 03:15

github-actions

[email protected]

3fa0edd

[email protected]

Patch Changes

18c7f44: Fix breaks on Unicode extended characters

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor Changes

Patch Changes

Minor Changes

Minor Changes

Minor Changes

Patch Changes

Patch Changes

Patch Changes

Minor Changes

Patch Changes

Minor Changes

Patch Changes

Patch Changes

Releases: cometkim/unicode-segmenter

Minor Changes

Patch Changes

Minor Changes

Minor Changes

Minor Changes

Patch Changes

Patch Changes

Patch Changes

Minor Changes

Patch Changes

Minor Changes

Patch Changes

Patch Changes