Skip to content

Releases: cometkim/unicode-segmenter

[email protected]

16 May 03:16
954dc8e
Compare
Choose a tag to compare

Minor Changes

  • f1a43ff: Cleanup mixed use of takeCodePoint and String.prototype.codePointAt
    • grapheme: Use String.prototype.codePointAt
    • grapheme: Optimize character length checking, also reduce the size a bit
    • utils: Add isBMP and isSMP util to check a codepoint number is being BMP(Basic Multilingual Plane) range
    • utils: Deprecated takeCodePoint and takeChar in favor of ES6 String.prototype.codePointAt and String.fromCodePoint
    • utils: takeChar is no longer depends on String.fromCodePoint internally

[email protected]

12 May 02:40
ca5b4e3
Compare
Choose a tag to compare

Patch Changes

  • 03e121c: Optimize grapheme cluster boundary check

[email protected]

12 May 01:25
fd729cf
Compare
Choose a tag to compare

Minor Changes

  • 04455e0: Implement GB9c rule from Unicode® Standard Annex #29
  • f9d3dd1: Hide the internal fields of the Intl adapter to prevent auto-completion

[email protected]

07 May 04:01
9f3074d
Compare
Choose a tag to compare

Minor Changes

  • 06159a4: Fix ESM module resolution, and make ESM-first (still support CommonJS by condition)

[email protected]

20 Apr 12:57
421d44d
Compare
Choose a tag to compare

Minor Changes

  • e2c9e1d: Optimize perf again 🔥

    It can be still getting faster, why not?

    Through seriously thoughtful micro-optimizations, it has achieved performance improvements of up to ~30% (404ns -> 310ns) in the previously used simple emoji joiner test.

    Now it use more realistic benchmark with various types of input text. In most cases, unicode-segmenter is 7~15 times faster than other competing libraries.

    For example, here a Tweet-like text ChatGPT generated:

    🚀 새로운 유니코드 분할기 라이브러리 \'unicode-segmenter\'를 소개합니다! 🔍 각종 언어의 문자를 정확하게 구분해주는 강력한 도구입니다. Check it out! 👉 [https://github.com/cometkim/unicode-segmenter] #Unicode #Programming 🌐
    

    And the result then:

    cpu: Apple M1 Pro
    runtime: node v21.7.1 (arm64-darwin)
    
                           time (avg)             (min … max)       p75       p99      p999
    --------------------------------------------------------- -----------------------------
    unicode-segmenter   7'850 ns/iter   (7'753 ns … 8'122 ns)  7'877 ns  8'079 ns  8'122 ns
    Intl.Segmenter     60'581 ns/iter    (57'916 ns … 405 µs) 59'167 ns 66'458 ns    358 µs
    graphemer          66'303 ns/iter    (64'708 ns … 287 µs) 65'500 ns 73'459 ns    206 µs
    grapheme-splitter     146 µs/iter       (143 µs … 466 µs)    145 µs    157 µs    397 µs
    
    summary
      unicode-segmenter
       7.72x faster than Intl.Segmenter
       8.45x faster than graphemer
       18.6x faster than grapheme-splitter
    
  • ab6787b: Make the Intl adapter's type definitions compatible with the original

  • f974448: - Rename searchGrapheme to searchGraphemeCategory, and deprecated old one.

    • Rename Segmenter definitions from grapheme module to GraphemeCategory.
    • Remove SearchResult<T>, and GraphemeSearchResult defnitions which are identical to CategorizedUnicodeRange<T>.
    • Improve JSDoc comments to be more informative.
  • dc62381: Add takeCodePoint util to avoid extra String.codePointAt()

Patch Changes

[email protected]

18 Apr 04:50
8189851
Compare
Choose a tag to compare

Patch Changes

  • e0b910d: Fix {Extend}+{Extended_Pictographic} sequence

    Counterexample:

    • '👩‍🦰👩‍👩‍👦‍👦🏳️‍🌈' -> 3 graphemes

    Reported from eslint/eslint#18359

[email protected]

18 Apr 03:36
0dbaec7
Compare
Choose a tag to compare

Patch Changes

  • 77af2ac: Fix CommonJS module resolutions

[email protected]

18 Apr 03:18
0b622ee
Compare
Choose a tag to compare

Minor Changes

  • c74c6a0: Expose /utils entry with a helpful API

    • takeChar(input, cursor): take a utf8 character from given input by cursor
  • c3ceaa5: Add countGrapheme utility

  • 955814a: Expose some low-level APIs that might help other dependents

  • 7592c3b: - New entries for Unicode's general and emoji properties

    import {
      isLetter, // match w/ \p{L}
      isNumeric, // match w/ \p{N}
      isAlphabetic, // match w/ \p{Alphabetic}
      isAlphanumeric, // match w/ [\p{N}\p{Alphabetic}]
    } from "unicode-segmenter/general";
    
    import {
      isEmoji, // match w/ \p{Extended_Pictographic}
      isEmojiPresentation, // match w/ \p{Emoji_Presentation}
    } from "unicode-segmenter/emoji";
    • Grapheme segementer now yields matched category to _cat field.
      It will be useful when building custom matchers by the category

      e.g. custom emoji matcher:

      function* matchEmoji(str) {
        for (let { index, segment, input, _cat } of graphemeSegments(str)) {
          if (_cat === GraphemeCategory.Extended_Pictographic) {
            yield { emoji: segment, index };
          }
        }
      }
  • 7592c3b: Add more low-level utilities

    • isHighSurrogate check if a UTF-16 code in high surrogate
    • isLowSurragate check if a UTF-16 code in low surrogate
    • surrogatePairToCodePoint convert given surrogate pair to a Unicode code point

Patch Changes

[email protected]

16 Apr 16:07
0909f1c
Compare
Choose a tag to compare

Minor Changes

  • 9938499: Getting 2x faster by optimizing hot path. Also with reduced bundle size

    By casting Unicode chars to u32 in advance, all internal operations become 32-bit integer operations.

    The previous version (v0.1.6) was

    • 2.47x faster than Intl.Segmenter
    • 2.68x faster than graphemer
    • 4.95x faster than grapheme-splitter

    Now it is

    • 5.04x faster than Intl.Segmenter
    • 5.52x faster than graphemer
    • 9.83x faster than grapheme-splitter

Patch Changes

  • b6824b5: Mark sideEffects on the polyfill bundle
  • 7c68863: Reduce bundle size a bit by inlining internal constants, and removing unused insternal state.
  • 9938499: Reduce bundle size a bit more
  • f1c80b7: Publish sourcemaps

[email protected]

15 Apr 03:15
3fa0edd
Compare
Choose a tag to compare

Patch Changes

  • 18c7f44: Fix breaks on Unicode extended characters